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PREFACE 


This volume is intended for two general classes of readers: 
first, the teacher wishing to make a serious study of the 
theory and practice of objective examining, and second, the 
student who is beginning his study of educational measture- 
ment. The organization of the book has kept in mind the 
fact that the general reader will not alwa3^ be familiar with 
statistical procedures. For this reason the more technical 
topics are reserved for the closing chapters. 

It is the author’s conviction based upon nearly ten years 
of experience with the construction, administration, and 
teaching of tests and measurements that the introduction 
to the theory and practice of educational measurements 
may best come from an initial study of informal classroom 
tests. The teacher and the student are already familiar 
with the usual school examinations. This background pro- 
vides a basis for the introduction of such concepts as validity, 
reliability, objectivity, sampling, errors of measurement, etc. 

After tte basic concepts of educational measurement have 
been mastered through the avenue of informal objective 
tests, the student is in a position to evaluate standard tests 
critically. 

This volume has been given the subtitle. An Introduction 
to Educational Measurement, in the belief that such a treat- 
ment should precede and introduce formal study of stand- 
ardized measurements. 

Judging by the developmaits of the past five years, the 
objective examination has come to stay for an indefinite 
period. It is almost certain to persist, in one form or 
another, just as the older forms of exaimnations will con- 
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tinue in use, with important modifications. No one method 
of examining is likely to prove adequate for the varied 
purposes of the teacher. The future-will see the traditional 
examination, the new-type test, and the standard test 
exist side by side. We may expect continual efforts at 
correlation and delimitation of these three types of measure- 
ments but not the elimination of any one. 

It is fairly certain that the objective examination will 
displace much of the testing now done by the traditional 
written test or essay examination, particularly in the 
measurement of information. It is still an open question 
whether the objective test will prove adequate for the 
measurement of appreciational skills. For the time being 
we must choose between the subjective question which is 
most difficult of evaluation and the objective test item whose 
validity is not entirely assured. 

There can be no ultimate conflict between the objective 
test and the standard test. As our skill in objective examin- 
ing of the informal type increases, there is certain to arise a 
more critical attitude toward standard tests. This will have 
far-reaching results as there are many present signs of storm 
clouds hovering over the educator who accepts the standard 
test on faith. The standard test has thus far taken ad- 
vantage of a “halo” arising from a largely mistaken belief 
that it represents a more “scientific” instrument than the 
classroom teacher can construct. This is not true in general. 
The best of existing standard tests do represent a degree of 
refinement not possible without extended experimentation. 
But the rank-and-file of such tests are readily equaled or 
bettered by the teacher who has mastered a little theory of 
measurement and who is seriously intent upon building valid 
and reliable examinations. 

The author has drawn freely upon his two earlier books 
on the same subject, particularly the Improvement of the 
Written Examination. Grateful acknowledgment is made to 
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Scott, Foresman and Company for this privilege. There 
are a number of conclusions and recommendations in the 
present treatment which are in contradiction with previously 
published statements. The available experimental evidence 
in 1923-1924 when the Improvement of the Written Examina- 
tion was being written was almost negligible. The past five 
years have brought forward a very respectable mass of 
empirical findings. If the author in his earlier writings 
guessed, and guessed wrong at times, it may be charged to 
the general lack of knowledge at the time. 

The reader will undoubtedly be conscious of the fact that 
there is a considerable amount of repetition of certain ideas 
like the theory of sampling, need for reliability, measure- 
ment as ranking, etc. This was done deliberately under the 
feeling that such ideas are not familiar to most teachers and 
that repetition is one of the best means of emphasis. The 
author is quite aware that a more concise and logical, but 
less psychological, treatment might have been pursued. 

Dr. Noel Keys, Associate Professor of Education, Univer- 
sity of California, read all the manuscript and made scores 
of valuable suggestions. Dr. Hermann Remmers of Purdue 
University also read portions of the manuscript to its great 
improvement. The author’s greatest indebtedness is to 
Dr. Ben D. Wood of Columbia University. Although 
geographically far apart. Dr. Wood and the author have 
been very close together in the directions which their studies 
have taken. At times they have quite independently arrived 
at the same conclusions. This has been a source of great 
satisfaction, particularly since Dr. Wood is recognized as 
parhaps the foremost investigator and advocate of the new- 
type examination. Less directly the author is indebted to 
former teachers of educational and mental measurement. 
Doctors Lewis M. Terman and Truman L. Kelley of Stan- 
ford University. The World Book Company, the Macmillan 
Company, the Bureau of Public Personnel Administration, 
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and the State University of Iowa have permitted the quota- 
tion or reprinting of certain of their copyrighted materials. 
A great many public school teachers have contributed ob- 
jective examinations used for purposes of illustration. Spe- 
cific reference is given to these sources at the proper places. 
The General Bibliography at the end of the volume is the 
work of Mr. Sanford Siegrist, Mr. George Meyer, and the 
author. The numerous studies of the author’s graduate 
students are given recognition throughout the text. 

Miss Birdie Weisbrod and Mr. George Meyer read all 
the proof and checked many of the calculations. 

It is the hope of the author that a great many classroom 
teachers and beginning students in education will take the 
trouble to go through at least Parts I and II of this volume 
and accept or reject the various points of view. It is not 
to be expected that the author is invariably correct in his 
statements. The important thing is to have thought through 
the issues presented. 

G. M. R. 

BeREELET, CilLIFORNIA 
Jakcakt 5 , 1929 
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THE ARGUMENT FOR OBJECTIVE 
EXAMINATIONS 




CHAPTER I 


POINTS OF VIEW 

EARLY IDEAS ON EXAMINATIONS 

Historical and introductory. The past quarter of a 
century has witnessed the rise of educational measurement 
to the plane of conscious striving for objective, impartial, 
and comparative means for portrasdng the absolute and 
relative achievements of pupils. Prior to the be ginning of 
the present century, teachers, although long familiar with 
examinations, did not view their tests zind examinations as 
measurements in the present meaning of the word. Oral 
quizzing, Socratic or otherwise, had been from time imme- 
morial a part of the daily classroom routine; in fact, at times, 
it was all of teaching. Formal written examinations are 
probably more recent than oral testing, but these date their 
origins many centuries ago; certainly formal written ex-^ 
aminations were firmly intrenched in the educational system^ 
of China thirteen hundred years ago, and were familiar to 
Grecian and Roman teachers. 

In America ex amin ations appear to be as old as formal 
education itself. Horace Mann, as early as 1845, formulated 
a clean-cut concept of the written examination and its 
superiority over such older methods as the oral quiz. 

The serious student of the art of examining will find a 
mine of delightful and valuable information about early 
examination methods in that most interesting volume by 
Professors Caldwell and Courtis entitled Then and Now in 
Education: 1845-1923.^ 

iLpublished in 1924 by the World Book Co., Yonkers-on-Hudson, N. Y. 
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4 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

Horace Mann and the written examination. Writing in 
1845 Horace Mann argued that the new examination was 
superior to all other methods for the following reasons: 

1. It is impartial. 

2. It is just to the pupils. 

3. It is more thorough than older forms of examination. 

4. It prevents the “officious interference” of the teacher. 

5. It “determines, beyond appeal or gainsaying, whether the pupils 
have been faithfully and competently taught.” 

6. It takes away “all possibility of favoritism.” 

7. It makes the information obtained available to all. 

8. It enables aU to appraise the ease or difficulty of the questions.^ 

These arguments were advanced in justification of what 
was really the first American school survey, that of the 
Grammar and Writing Schools of Boston in 1845, Horace 
Mann proceeds further to justify the ‘‘new’' examination: 

... it submits the same question not only to aU the scholars who are 
to be examined, in the same school, but to all schools of the same class or 
grade. Scholars in the same school, therefore, can be equitably compared 
with each other; and all the different schools are subjected to a measure- 
ment by the same standard. Take the best school committee-man who 
ever exposed the nakedness of ignorance, or detected fraud, or exploded the 
bubbles of pretension, and let him examine a class orally, and he cannot 
approach exactness in judging of the relative merits of the pupils by any 
very close approximation. And the reason is apparent. He must propound 
different questions to different scholars; and it is impossible that these 
questions should be equal, in point of ease or difficulty. A poor scholar 
may be asked a very difficult one, and miss it. A good scholar may be 
asked a very difficult one, and miss it. In some cases a succeeding scholar 
may profit by the mistakes of a preceding one; so that, if there had been a 
different arrangement of their seats, the record would have borne a different 
result of plus and minus. The examiner may prepare himself as carefully 
as he pleases, and mark out the precise path he intends to pursue, and yet, 
in spite of himself, he may be thrown out of his path by unforeseen dr- 
cumstances. But when the questions are the same, there is exactness of 
equality. Balances caimot weigh out the work more justly. So far as the 
examination is concerned, all the scholars are “bom free and equal.” 

Courtis, Then and Now in Education (Yonkers-on-Hudson, New 
York; World Book Company, 1923), p. 37. 
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Suppose a race were to be run by twenty men in order to determine their 
comparative fieetness; but instead of bringing them upon the same course, 
where they could all stand abreast and start abreast, one of them should be 
selected to run one mile, and so on, until the whole had entered the lists; 
might it not, and would it not so happen that the one would have the luck 
of running up hill, and another down; that one would run over a good turn- 
pike and another over a “corduroy”? Pupils required to answer dissimilar 
questions are like runners obliged to test their speed by running on dis- 
similar courses. 

Again, it is clear that the larger the number of questions put to a scholar, 
the better is the opportunity to test his merits. If but a single question 
is put, the best scholar in the school may miss it, though he would answer 
the next twenty without a blunder; or the poorest scholar may succeed in 
answering one question, though certain to fail in twenty others. Each 
question is a partial test, and the greater the number of questions, therefore, 
the nearer does the test approach to completeness. It is very imcertain 
which face of a die will be turned up at the first throw; but if the dice are 
thrown all day, there will be a great equality in the number of faces turned 
up.^ 

Before commenting on the foregoing quotation, let us 
proceed further with the reproduction of these prophetic 
and discerning ideas of Horace Mann. 

Suppose, under the form of oral examination, an hour is assigned to a 
class of thirty pupils; this gives two minutes apiece. But under the late 
mode of examination (the uniform written examination), we have the 
paradox that an hour for thirty is sixty minutes apiece. Now it often 
happens that a sterling scholar is modest, diffident, and easily disconcerted 
under new circumstances. Such a pupil requires time to collect his faculties. 
Give him this, and he will not disappoint his best friends. Debar him from 
this, and a forth-putting, self-esteeming competitor may surpass him. In 
an exercise of two minutes, therefore, the best scholar may fail, because he 
loses his only opportunity while he is summoning his energies to improve 
it; but give him an hour, and he vdll have time to rally and do himself 
justice. It is one of the principal recommendations of this method, indeed, 
that it excludes surprise as one of the causes of failure, and takes away the 
simulation of it as an excuse. 

And again: 

It sometimes happens that when an examiner has brought a class or a 
pupil to a test-question — ^to a i>oint that will reveal their condition as to 


iZrOC. ciL, pp. 39-40- 
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ignorance or knowledge — ^the teacher bolts out with some suggestion or 
leading question that defeats the whole purpose at a breath A 

Comment on the views of Horace Mann. It must be 
disconcerting to the modem writer on educational measure- 
ment to read these paragraphs from the pen of Horace Mann 
nearly a century ago. This volume by Caldwell and Courtis 
carries an eloquent reprimand to those of us who, in writing 
on topics related to tests and testing, have, through ig- 
norance of earlier thought, imagined that the science of 
educational measurement is wholly novel and quite recent. 
Dr. J. M. Rice has been rather generally credited with being 
the father of educational measurement, but these quotations 
show clearly that certain of the essential ideas of the measure- 
ment of classroom products antedated these commonly 
accepted pioneers — IWce, Thorndike, Courtis, Stone, et al. 

The author hastens to warn the reader that these apprecia- 
tions of the insight of Horace Maim must not be carried too 
far on the wave of enthusiasm. After all, Horace Mann’s 
examination — ^his “idea in mind’’ — was not the standard 
test of today nor even the modem teacher’s concept of an 
adequate objective and impartial instrument of evaluation. 
The author is no student of the history of education; re- 
search will in all probability reveal even more remote writers 
who were the source of stimulation to the thinking of Horace 
Mann. Present comments are not directed at the establish- 
ment of the ultimate sources of our ideas on measurement, 
but rather, at the expression of a proper humility and an 
acknowledgement that the fundamental thinking on the 
merits and limitations of written examinations is an older 
story than current textbooks on tests and measurements 
have given us to believe. 

Analysis of the ideas of Horace Mann. The quotations 
from these writings of Horace Mann will repay the reading 


^Loc, ci7., pp. 40-41, 
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again. There are certain ideas imbedded there which form 
the basis of current practices in educational measurement. 
If the reader will follow the suggestion of re-reading these 
quotations anal 3 rtically, he will find the following funda- 
mental propositions boldly stated: 

1. Examinations should be written ratha: than oral, 
(This of course refers to those final evaluations, which, in 
the custom of 1845, were delegated to the school committees.) 

2. The questions presented should be uniform for all. 

3. Uniform written examinations are more economical of 
time than are oral, individual examinations. 

4. Uniform written examinations are longer in effect; i. e., 
they “sample more widely” (to phrase the statement in the 
modem terminology). 

5. Examination questions differ greatly in difficulty, and 
such differences operate to obscure the real differences in 
ability among pupils. 

6. Oral examinations tend to be unsystematic, and to be 
deflected from the aim of the examiner by unforeseen cir- 
cumstances. 

7. Uniform examinations place all students under the 
same conditions; i. e., they approximate an experimental 
situation, rather than the analogy of Horace Mann where 
twenty men ran a mile race over unequal courses. 

8. Any examination is a limited sampling of a pupil’s 
knowledge and skill; the larger the number of questions the 
fairer the test. 

9. There is a marked chance elanent in success or failure 
on an examination. (The use by Horace Mann of the analogy 
of the throwing of dice is a strikmg forerunner of much recent 
controversial literature on such tests as the true-false and 
multiple-choice.) Mann’s statements of the outcomes of 
repeated throws of a die and the long-time stabilization of 
results caimot fail to interest the student of modem literature 
on examination methods. 
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10. Oral examinations are relatively less reliable because 
of the greater tendency to emotional distobances as com- 
pared with written tests. 

11. Examinations should be freed from the inadvertent 
and very human tendency of teachers to assist pupils with 
the answering of the questions. 

12. Examinations should be of considerable length. (This 
is implied rather than stated when Mann chose sixty nainutes 
as the length of an examination in contrasting the scope of 
oral and written examination calling for one hour of time.) 

The writer is somewhat disturbed by the danger of reading 
more into Mam’s statements than Ae latter implied. Of 
this the reader must be the judge; in most instances Horace 
Mann seems to state the points without ambiguity, although 
present knowledge has refined and greatly extended these 
principles. 

Regardless of the degree of parallelism of Horace Mam’s 
ideas and the recent development of standardized and un- 
standardized objective (or “new-type”) examinations, these 
quotations, comments, and analyses form an excellent start- 
ing-point for the series of discussions which have been 
grouped in this chapter under the heading, “Points of View.” 

THE RETENTION OR ELIMINATION OF EXAMINATIONS 

The justification for examinations. Since there are so 
many vociferous critics of examinations, (by no means con- 
fined to public-school pupils and rebellious college under- 
graduates) it is fair to raise the question whether such 
measures are to be justified at all. Certain it is that the 
arguments pro and con rest largely upon opinion. There is 
no convincing evidence that examinations are essential to 
the complete procedure of instruction. There is similarly 
no indisputable evidence that they are not. For this reason 
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alone, the present treatment of examinations is justified in 
avoiding this highly controversial issue. 

In the second place, no one even reasonably well informed 
about recent developments in education can doubt that the 
written examination is taking a new lease on life. A quarter 
of a century and less has witnessed the rise of the standard- 
ized examination, and still more recently the unstandardized, 
informal, teacher-built, objective test.^ 

The whole modem emphasis on the psychology of individ- 
ual differences and the attendant problems of measurement 
is too recent zmd too firmly intrenched to make probable any 
lessening of the esteem in which examinations are held. 
Aside from these newer developments the examination 
system is firmly established, and although it will certainly 
remain under fire from those who look on it with disfavor, 
present indications point strongly to the fact that the main 
effort will be expended in perfecting it along directions 
suggested by recent experimentation. 

In general, it will be more fruitful to attempt to better a 
practice which consensus of opinion still favors before con- 
sidering the more drastic step of complete elimination. The 
burden of reform always falls upon the reformer. Whether 
the written examination holds its place in the curriculum from 
merit or from precedent, those who would do away with it 
must assume the obligation of experimental demonstration 
of the futility of examinations. The critics of examinations 

iThe type of examination with which this book deals principally is variously termed 
^‘informal/' ^‘new-type,” ^‘unstandardized,” “short-answer/* and “objective." None of 
these names is particularly fortunate since each emphasizes but a single one of several 
important attributes of suw examinations. 

On the whole, the author prefers the term objective examination^ bdlieving that it em- 
phasizes the most important single point of difference between two contrastmg types of 
examination, i. e., objectivity or freedom from personal opimon in evaluating examination 
results. 

The term newAype examination has been very largely employed by Dr. Ben D. Wood, 
a foremost student of examination methods. This desimation unfortunately seems certain 
to lose its meaning with the passing of time. (It should be recalled that Horace Mann used 
an almost identical terminology in 1845, and already we are proposing a “newer" type 
of test.) 

The present volume wiU use several of these designations interchangeably, following 
the practice of most writers. 
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have mostly argued a priori; they have marshaled little or 
no experimental evidence, and the present temper of pro- 
fessional education draws it constantly further from con- 
viction by argumentation and constantly nearer to con- 
viction by experimentation. The worth of examinations is 
open to crucial experimental determination, and nothing 
short of this will be convincing in the long run. 

For reasons advanced above, in part, it is justifiable to 
avoid completely the broader question of the retention or 
elimination of examinations, and to study instead the 
possibilities of improving our devices for measurement of 
school accomplishment. 

THE FUNCTIONS SERVED BY EXAMINATIONS 

Classification of the purposes of examinations. Although 
a great many specific functions have been claimed for the 
written examination by one writer or another, these may for 
present purposes be grouped as four, as follows: 

1. Motivation of the learning of pupils. 

2. Maintenance of standards of accomplishment. 

3. Training in the use of the English language. 

4. Measurement of accomplishment. 

Examinations for motivation. It is unfortunate that we 
have so little direct information as to the motivating effect 
of examinations. That examinations do have this value 
has been tacitly agreed but never proved. In spite of this 
dearth of proved fact, it does seem reasonable to suppose 
that pupils strive for somewhat greater and somewhat 
more permanent mastery when they realize that searching 
examinations may be expected at a later date. If this con- 
clusion is true, certain reforms in the examination system 
might greatly increase the value of the examination as a 
motivator. W^e might argue somewhat as follows; 
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1. The motivating value of an examination varies with 
the esteem in which it is held by pupils. The more impartial 
and objective the examination marks, the more meaning they 
will have for the pupil. 

2. Examinations should come at frequent intervals, and 
should not be confined to end-of-semester and end-of-year 
testing. To examine extensively but infrequently delays the 
day of reckoning so long as to make the goal too remote to 
stimulate the pupil. It also encourages the cr amming 
attitude, which is of doubtful value. 

3. Where tests and examinations pxmctuate the teaching 
at frequent intervals, it is possible for the pupil to keep 
cumulative, graphic records of his achievement. Such records 
form one of the strongest forms of motivation which we 
know today. Experimental psychology has repeatedly 
demonstrated that output with knowledge of results is 
markedly greater than is the case where the learner is kept 
in ignorance of his successes and failures. 

4. When tests are of a detailed, specific, and diagnostic 
character, pupils cease to regard them as drudgery but come 
to depend upon them for guidance in remedying their weak- 
nesses and as preparation for future opportunities to better 
past records. 

The traditional examination provided a not altogether 
wholesome attitude on the part of pupils. As goals they 
were too remote. They called for an excessive expenditure 
of physical energy in writing. They represented an undue 
balance in the apportionment of time; too much for writing, 
too little for thiiiking. There was a considerable and, growing 
distrust of their accuracy in differentiating among the 
abilities of pupils. 

If examinations in the past have failed to serve as moti- 
vators, the fault is possibly due to the kind of examination. 
If the arguments advanced here are sound, it appears hope- 
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ful to attempt to increase the motivation value of the ex- 
amination through reforms in the examination system itself. 

Maintenance of standards of work. Many school super- 
visors fed that examinations and tests set by them offer a 
good means of control of standards of work by different 
teachers. This belief has led to the practice of conducting 
uniform city-, county-, and state-wide examinations. Such 
practices appear to be losing ground slowly, although almost 
fifty per cent of the individual states do have uniform state 
examinations in at least the eighth grade. In some states 
the county boards of education are the examining bodies. 
It is in the dties that uniform examinations have lost most 
ground, although some cities are returning to something 
like tmiform examinations, except that modem objective 
tests are employed instead of the older essay examinations. 
Standard educational and mental testing has provided a 
more secure basis for rendering instruction sufficiently uni- 
form from one classroom to the next. 

This brings us to a related problem: that of evaluating 
the efficiency of the teacher by tests of her pupils’ accom- 
plishments. It was on this point that Dr. J. M. Rice drew 
so much fire from the National Education Association at the 
beginning of this century. It was first thought that the 
standard test would serve to evaluate teaching upon the 
principle that if pupils showed high accomplishment, the 
teaching was good, but if pupil achievement was low, 
teaching was, perforce, unsatisfactory. 

The dangers inherent in this point of view were soon 
exposed. The standard test method made no allowances for 
differences in pupils’ mental equipment, the most important 
single factor controlling the rate of learning yet found. It 
was soon realized that standard tests were uixadaptable to 
local conditions, that they were open to abuses through 
coaching, that they were not as unerring guides as first 
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supposed, and that they often misfired in reaching the 
essential activities of the classroom. 

There is a legitimate place in the scheme of supervision for 
uniform examinations. If the general merit of a series of 
uniform examinations could be established, the periodic 
application of such tests would accomplish much by way of 
equalizing instruction and defining objectives throughout 
school systems. Where objectives and aims can be translated 
into conaete' test situations, supervision through locally 
constructed tests is far more economical than personal 
supervision. Such tests must obviously have a high degree 
of accuracy; they must confine themselves to essential 
outcomes vdthout placing limitations upon the means of 
arriving at these outcomes; they must parallel the curricular 
units with exactness; they must be numerous enough to 
cover all major curricular units; and they must be con- 
structed in duplicate and equally difficult forms so that they 
do not need to be repeated sufficiently often to lay them open 
to abuse through coaching and cramming. 

Further discussion of these points will be postponed for 
Chapters II and VII. 

Examinations as training in language. One of the 
strongest arguments for the traditional examination has been 
its reputed value in teaching pupils to organize their ideas 
and to place them on paper in good English. If written 
examinations in the past have served weU such an obviously 
important function, a change to the mechanical objective 
examinations will be a real loss to linguistic training. 

But what are the probable facts? Do the conventional 
examinations contribute importantly to the teaching of the 
Engli^ language? 

The conviction is slowly gaining ground that the value of 
written examinations in establishing good language habits is 
largely illusory. Since this book advocates replacement of 
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much of our examination system by the new-type objective 
tests (of little or no value in language training), its recom- 
mendations are certain to be opposed by those who think the 
examination supplies useful language dbill. In such a case 
the best defense is to attack. The arguments against the 
written examination, as contributing significantly to good 
English habits, may be marshaled somewhat as follows: 

1. Most teachers have noted that final examinations show 
a quality of diction, grammar, and spelling markedly inferior 
to the products of the regular English, composition, emd 
spelling classes. Some teachers are convinced that the 
written examination has a negative or destructive value in 
English training. 

2. The actual conditions of the examination period are 
unfavorable to good linguistic expression. The pupil is 
usually required to write five or ten long questions, very 
often taxing his speed to finish at all. He has little or no 
time to reflect upon his literary style. He will be lucky to 
get down the facts in the allotted time. 

3. The pupil realizes, consciously or unconsciously, that 
the paper will be graded upon facts, not style. He senses 
that his teacher wants to find out what he knows about 
geography, physics, etc., and he acts accordingly. 

4. Language habits 2 ire complex. They, like all other 
habits, are built up slowly and consciously. They do not 
arise by magic. It is unlikely that they will arise as by- 
products of frenzied efforts at setting down facts in limit^ 
time. 

The reader must judge of the soimdness of such arguments. 
If true, the ordinary written examination contributes nothing 
to language development, and if, further, the new-type 
examination comes into general use, it will be necessary to 
seek some method of providing for the measurement of the 
power to organize and express ideas on paper. Possible 
solutions of this problem are: 
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1. Give up all attempts to make the examination period 
serve the purposes of language training and make provision 
for such needs elsewhere. 

2. Divide all (or certain) examinations in two parts, some- 
what £is follows: 

Part I: Objective (completion, true-false, etc.) in character; 
suggested number of items, 75 (as a minimum); time 
allowed, 20-30 minutes; total credit allowed, 75 points 
out of a total of 100. 

Part II: Discussional (essay-type) in character; suggested 
number of questions, 1 (or at most 2); time allowed, 
30 minutes (as a minimum); total credit allowed, 25 
points out of a total of 100. 

The division of credit as 75:25 is based upon the logic 
that a composite score on the two parts should not be 
rendered too inaccurate by allowing too much weight to the 
second part of the examination (which is not open to ac- 
curate marking). It should also be noted that 30 minutes 
per question is suggested as the time allowed tmder Part II. 
The reason for this will soon be made apparent. 

Part II shoxild be accompanied by instructions to the 
pupil substantially as follows: 

Instructions: 1. In this part of the examination you are not to be graded 
upon the number or accuracy of the facts which you put down. 

2. Your mark will be based upon the following factors: (a) evidence of 
thought, (b) good sentence and paragraph structure, (c) grammatical and 
dictional errors, (d) spelling, etc. In other words, consider that you are 
writing a theme or composition for an English class. 

3. Read the question (or questions) at least twice before you start to 
Work. 

4. Do not attempt to write anytlung until you have made a mitten 
outline of what you are going to say. 

5. As you make your outline, think through what you intend to write 
as your answer to the question. 

6. Do not hurry! Spend at least half of the time allowed in thinking and 
planning your outline. 
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7. Remember: You are not to be graded upon the facts about geography 
(or subject in question). Your mark will be based upon your ability to 
express your thoughts in correct English. 

Now it follows that a teacher electing to conduct such an 
examination must “play the game fairly” with her pupils. 
The pupil must not feel any pressure of time. The teacher 
must steel herself to avoid, as far as possible, marking the 
replies upon a basis of subject-matter, and she must take 
her time in evaluating such a paper. The marking will be 
highly subjective at best. 

The author, although perfectly serious in the foregoing 
proposal, does not believe that many teachers will adopt the 
type of examination suggested for Part 11. The honest- 
minded ones will see that our proposal is consistent with an 
unprejudiced effort to make the written examination function 
in language training. The majority of us will be more likely 
to conclude that “If I have to go to all this bother, I’ll find 
some other way of teaching English. After all, examinations 
are intended to measure pupils’ abilities, not to teach them 
the English language.” We may not condone such an attitude 
— ^but is it not the expectancy? 

Examinations as measurements. The measmement of 
achievement has been admittedly the principal reason for 
examinations. This idea is undoubtedly sound. It may 
require certain qualification, but all seem to be agreed that 
the first purpose of a test or examination is that of ascertain- 
ing the degree to which individual pupils have profited by 
instruction. 

Although we have referred to measurement as a single idea 
or term, it really includes a number of feiirly discrete pur- 
poses, viz.; 

1. Measurement of general or all-round ability in a school 
subject. These are usually comprehensive fir^ (semester 
or year) examinations designed to show the pupil’s general 
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grasp of the subject and to stimulate him to review. Such 
tests or examinations should represent the “high points” and 
should help to leave the pupil with a bird’s-eye view of the 
subject. 

2. Diagnosis of specific strengths and weaknesses in in- 
struction and the profit from instruction. Such tests are 
usually given at the time of completion of each important 
teaching unit. They are detailed, and they must 1^ very 
reliable if the diagnosis is to have meaning for individual 
pupils. Results firom such diagnostic tests should lead to 
two general outcomes: 

(«) Improvement of the instruction of the teacher. 

(d) Guides to remedial or corrective work for the pupil. 

3. Examinations for prognosis, placement, sectioning, etc. 
Although standard educational and mental tests are more 
often used for this purpose, informal objective tests have 
great possibilities here. There are dozens of questions to be 
answered in handling pupils in a modem school, e.g., ad- 
mission to high school or college, placement in proper grades 
of transfers, sectioning into ability groups, educational and 
vocational guidance, prediction of future success, etc.^ 

The question of when a pupil has been measured accurately 
is a principal theme of this volume. Several chapters further 
on we shall be in a better position to discuss examinations 
in the light of the theory and practice of measurement. 
There it will be shown that measurement is never complete, 
but merely samplings of ability. “Old” and “new” types of 
tests will be studied in the light of the criteria which good 
examinations must meet. There is a new vocabulary, that 
of the professional student of examinations, which must be 
assimilated and absorbed into our thinking before we can 
hope to approach the ins and outs of various t3rpes of mea- 

^See P. M. Syxnonda, Measurtmtnt in Stcondary Education (New York: The M a c mil l an 
Con^ny, 1927j, pp. 1-2. 

G. M. Ruch and G. D. Stoddard, Tests and Measurements in Sigh School Instruction 
(Yonkers-on-Hudson, New York: World Book Company, 1927), pp. 8-44. 
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surements in a critical and analytical fashion. For these 
reasons further discussion of this point is unwarranted at 
this time. 

THE PRINCIPAL KINDS OP EXAMINATIONS 

A classification of examination methods. Four types of 
measurements exist side by side in the modem school. 
These are: 

1. Oral questioning 

2. The traditional examination^ 

3. The standard test 

4. The objective or new-type examination 

There are other means of evaluating school results, but the 
four tirpes mentioned are the most important. 

WilJx such a variety of methods open to the educator, and 
with so little of a final character known about their relative 
merits, the only course open to us is to consider both the 
logic and the growing body of experimental findings sup- 
porting or undermining the value of each. This is the task 
of this volume in its entirety, but a few brief comments will 
help to give a point of view. 

The oral examination. Strictly speaking oral questioning 
does not usually constitute an examination. Oral examina- 
tions are sometimes employed, but, with Horace Mann, we 
doubt their value for the more serious and final determina- 
tions of achievement. This is no argument against oral 

iThe rise of objective or new-type examinations makes necessary a distinction between 
the loi^-estabhsh^ form of test and the more recent and more objective type of examina- 
tion. The former has come to be known as the traditional txominaiixm or the assay txamina~ 
tion. The traditional examination needs no definition. It is the examination wWch we 

recognize as consisting of five, ten, or more questions, beginning most often with *^tate 
in full, **Descnbe|,” ‘^Twl what you know/^ etc. The pupU is free to write what he chooses 
as a response to the stimulus question. It is to be contrasted in its mechanics with the 
newer objective examination in that the latter calls for underlining, crossing-out, eliding, 
etc,, instead of discussion* The traditional examination cannot be scor^ wArbanifaiiy by 
keys or stencils but must be evaluated subjectively by competent persons. 



POINTS OF VIEW 


19 


questioning. In many ways the teacher’s daily questio ning 
of her pupils is of far more fundamental importance than her 
final written examination. The point is that oral questioning 
is more logically a part of initial instruction than of final 
measurement, assuming that there are at least five roughly 
distinguishable phases to the complete act of instruction, 
as follows: 

1. Initial presentation of materials to be mastered. This 
phase consists of setting problems to be solved, textbook 
readings and discussions, teachers’ comments on persistent 
difficulties in learning, etc. 

2. Drill to support and fix the temporary mastery gained 
imder the first phase of instruction. TWs may be drill 
proper or it may mean applications and reviews. 

3. Diagnostic mezisurement at the period when phases one 
and two are thought to be complete. 

4. Re-teaching or remedial instruction upon any weak- 
nesses revealed under the third phase. 

5. Final measurement and evaluation of a more general 
and less detailed character than that of phase three. This 
constitutes the final sxirvey of achievement and leads to a 
judgment as to whether the individual or class is ready to 
proceed to new work. 

It should be noted that certain of these phases are less 
prominent than others at times, the relative emphasis vary- 
ing with the character of pupil, teacher, textbook, subject, 
motivation, etc. 

Oral questioning plays its greatest role in the first, second, 
and fourth phases of instruction as presented above. It is 
primarily instructional; its value for measurement is more 
subordinate. Oral questioning as an art has a long history 
and a considerable literature. It is worthy of more experi- 
mental study than it has received to date. 
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The traditional written examination. The familiar dis- 
oission- or essay-type examination has long been the 
principal reliance of the teacher in evaluating pupil ac- 
compUshment. It is probably the most frequently employed 
examination at present, although this dominance shows signs 
of breaking. Its advantages cannot be stated more clearly 
than our previous quotations penned by Horace Mann 
almost a century ago. Its weaimesses are numerous, but 
this is not the place for the discussion of such limitations. 

It is sufficient to point out that the essay examination 
suffers from one major defect not inherent m the standard 
test or the newer objective examioation; viz., experience and 
experiment have shown that the results of an essay examination 
cannot be evaluated fairly by human minds. Its inaccuracies 
are those of the human mind and the human prejudice. 
Such examinations seemingly cannot be freed from the 
personal equation. 

Our chief interest in the present chapter is in broader 
points of view about examinations. The mark on the con- 
ventional examination cannot fail therefore to interest us 
as students of examinations. To the degree that an ex- 
amination mark or grade reflects the knowledge, attitudes, 
and prejudices of the marker of that examination paper, the 
examination is not a true measurement since all are surely 
agreed that it is the accomplishment of the pupil which is 
to be measured. If, as we shall see later, the same pupil’s 
paper is graded all the way from 40 to 90 (as many investi- 
gators have found), there is but one conclusion to be drawn; 
viz., the pupil has not been measured. To be at the same 
time a “40” pupil (a dunce) and a “90” pupil (a candidate 
for the class valedictorian) is not only unthinkable but 
palpably untrue! Such a foding raises the suspicion that 
he is neither, a conclusion that can well be supported on the 
ordinary logic underlying our basic theorems of probability. 
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To be taken at face value, any examination result must 
meet many stringent criteria, and one of these is that it be a 
measure of the pupil— not the teacher, not his class, and 
not the school system. Yet it must be admitted by any fair- 
minded student of the literature that the traditional examina- 
tion is prone to tell us as much, or almost as much, about 
whom the pupil had for a teacher as it does about the edu- 
cational equipment of the pupil himself. We shall not trouble 
to prove this assertion at this time. 

The technical term for this weakness in the common 
essay-type examination is, in our modern educational 
terminology, subjectivity of marking. It was as a relief from 
this admitted weakness that the standard test and the 
objective examination were introduced. How adequate the 
remedy wiU prove to be cannot be foretold here, although we 
may study the evidence, combine this evidence with our 
logical and experimental deductions, and finally arrive at a 
tentative point of view. This will have to be the task of 
succeeding chapters. 

The standard test. Standardized examinations have just 
completed the first quarter century of their existence. From 
a few pioneer attempts by Rice, Thorndike, Stone, Courtis, 
and others in the fields of spelling, arithmetic, and reading, 
the movement has grown until conservative estimates place 
the total number of available tests and scales at at least five 
hundred; there are probably considerably more. It is im- 
possible to secure even approximate estimates of the numbers 
of standard tests administered annually. There are several 
educational tests whose sales have pzissed the million mark 
annually. In one or two cases, two million is a more nearly 
correct figure. The total number of standard tests sold dur- 
ing the past year is probably at least twenty million, possibly 
somewhat more. 
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These figures, estimates as they are, point to the import- 
ance of standard tests as measures of the results of teaching. 
It seems certain that the curve of the use of standard tests 
is rising more rapidly than is that of the increase in school 
population. 

The standard test was introduced to serve several pur- 
poses. These are not in all respects a prime concern of the 
present treatment, but they serve to orient our thinking 
about tests and examinations in general. The principal 
aims of the standard test may be listed as follows: 

1. They (as the name implies) represent an attempt to 
control or standardize the conditions of the examination 
period with respect to directions, time allowances, method of 
responding, etc. 

2. They are objective or impartial; i. e., the personal 
equation of the examiner is minimized or eluninated— minim- 
ized in the administration, and eliminated almost or quite 
completely in the scoring of the examination. 

3. They provide norms or standards (as the name further 
implies) by which the scores of individual pupils may be 
evaluated and interpreted in the light of facts. Such facts 
are the performances of large numbers of supposedly t5T)ical 
pupils on the same tasks. 

These aims can all be attained to degrees commensurate 
with the practical needs of education, the third aim being 
the most difficult, and, on the whole, decidedly the least 
important. 

Against these advantages of the standard test may be set 
certain more or less fundamental limitations which are 
briefly enumerated below. 

1. Standard tests are inflexible and cannot be closely 
adapted to the idiosyncrasies of local school conditions. 
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They are of necessity general enough to meet moderately 
well a wide variety of curricula. 

2. In view of the foregoing, they need constant supple- 
mentation in a complete measurement program. 

3. Standard tests are somewhat expensive. The range of 
prices varies from about one cent per pupil to at least ten 
cents per pupil. This, of course, is a practical limitation, 
not a theoretical one. It should also be noted that there is 
considerable correlation between cost and worth. As is the 
case with all commercial products, tests are sold in a com- 
petitive market, and costs are reckoned upon the basis of 
the expenses of production. 

4. The majority of standard tests are of little value. A 
large number are nothing more than “examinations with 
norms,” produced by persons without special training or 
knowledge of test construction. If one htmdred of the best 
were selected and the rest destroyed, the loss would be negli- 
gible. 

Only the first-mentioned of these limitations of the stand- 
ard test is serious. The others may be overcome by careful 
selection, by the planning of measurement programs, and by 
efficient school budgeting. It would appear to be impossible 
to adopt the standard test as the sole element in a measure- 
ment program. It might well repay the cost, but it is to be 
doubted whether local needs could ever be met satisfactorily. 
Both the traditional and the new-type examination are free 
firom this limitation of non-adaptability to local school 
curricula. 

The objective test or examination. For the present we 
will define objective tests by means of a few test items illus- 
trating several of the principal types. 

1. (Simple reeaU) The Senate and the House of Representatives to- 
gether form the United States 
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2. (Completion) Oxygen is often prepared by heating potassium 

chlorate together with which acts as an accelera- 
tor of the reaction. Such substances are called 

The formula for potassium chlorate is Water 

may be decomposed by the electric current to oxygen and 

in the ratio, by volume, of to 

3. (True-false) The rainfall is generally heavier on the eastern than 

on the western slope of north-and-south ranges in the path of the westerly 
winds. TRUE FALSE 

4. (Multiple-choice) A reduction in price for buying in large quantities 
is called a conunission discount dividend mortgage revenue 

The objective or new-type test is essentially a hybrid. 
It represents the objectivity of the standard test without the 
refinements of experimental study and standardization. 
Because careful standardization is not attempted, it may be 
produced almost as cheaply as the traditional examination, 
although at a much greater expenditure of time and energy. 
It (like the traditional examination) has a high degree of 
adaptability to local conditions. The lack of norms is a 
limitation, but no more serious a one than is the case with 
the older forms of examinations. The relatively smaller 
degree of refinement, as compared with well-made standard 
tests, can be compensated for very largely by increased 
length. This point will be discussed at length in later 
chapters. 

No data are available as to the extent to which objective, 
teacher-made tests are being utilized in our schools. There 
can be no doubt that these types of measurement are in- 
creasing in favor even more rapidly than are standard tests. 
Hailed a half-dozen years ago as “a new tsnpe of examina- 
tion,” they are regular routine in thousands of progressive 
schools. They show signs of their inroads into the practices 
of even the most conservative examination bodies. The 
New York Regents are conducting extensive experiments 
with these new tests. At least two of the states giving uni- 
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form state examinations are using objective tests wholly 
or in part. ^ There are doubtless others which have not come 
to the attention of the author. Certain school systems like 
Atlanta, Denver, Detroit, Los Angeles, Rochester, St. Louis, 
and many others are developing extensive batteries of such 
tests. In some states where uniform examinations are set by 
county boards of education, certain counties are now de- 
veloping series of objective tests for the elementary school 
subjects.® No special attempt has been made to collect 
adequate statistics on the use of objective tests throughout 
the United States. The specific references are those which 
come to the author’s mind at the moment of writing. In a 
recent competition for the construction of objective tests, 
about 400 examinations were submitted for consideration 
for the prizes offered.® Chapter VIII presents certain sum- 
maries of the findings from this competition. 

Looking back over the five-year period in which the author 
has been engaged in studying experimentally one phase or 
another of examination construction, the story of the progress 
of objective examination methods is ample grounds for 
predicting that something like the objective examination, 
when perfected, will be the principal reliance of the class- 
room teacher for the next few decades to come. There will 
doubtless be a place for the traditional examination in the 
future, but it seems likely that it will tend to become a last 
resort, to be employed when other methods are not at hand. 
It may be possible to perfect the ordinary examination so 
as to control some of its vagaries, but progress to date 
leaves small reason to hope for marked success.* As has 

'New Jersey and Wyoming. 

*E. g., Lewis County, New York and Kem County, California. 

»The results of this competition are published in G. M. Ruch and G. A. Rice, Specimen 
Objective Examinations (Chicago: Scott, Foresman and Company, 1929). This volume 
presents thirty-five of the best examinations in a total of nearly four hundred submitted. 
The prize-winning examinations will be found to be both interesting and valuable to the 
teacher iust beginning to employ objective examination methods. 

4Sw Cbapt^ in, pp. 101-106. 
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been said, the old- and the new-type must be regarded 
as complementary and not antagonistic; the latter type of 
examination will exclude the former more or less completely 
for informational subjects, and the former will doubtless 
continue to hold a place in the measurement of expressional 
and appreciational subjects. 



CHAPTER H 


THE CRITERIA OF A GOOD TEST 
OR EXAMINATION 

STATEMENT OF THE CBITEKIA 

The “ear marks” of a good examination. The criteria for 
use in the construction of standard and unstandardized tests 
naturally differ somewhat. The following have been selected 
as sufficient for the main outline in understanding the theory 
and construction of school examinations of the objective 
type. We may therefore proceed to outline the principal 
criteria of a good test or examination as follows: 

I. Validity 
II. Reliability 
A) Objectivity 

H) Extensity or adequacy of sampling 
HI. Ease of administration and scoring 
IV. Norms or standards for evaluation of results 
V. Availability of eqtiivalent or duplicate forms 

VAUDITT 

Definition of validitj. The most important single feet 
which can be known about a test or examination is the degree 
of validity which it possesses. Validity may be defined 
variously ; these separate definitions constituting, collectively, 
the ideas incorporated in the term. 

1. Validity is the degree to which a test or examination 
measures what it is intended to measure. 

27 



28 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

2. Validity is the general worthwhileness of an examina- 
tion. 

3. Validity refers to the care taken to incorporate in a 
test or examination those elements or items which are of 
prime importance, and to the pains taken to eliminate the 
non-essential. 

4. Validity is in general the degree to which a test parallels 
the curriculum and good teaching practice. 

5. Validity refers to the value of the test for measuring 
specific abilities in an accurate fashion, and a test ceases to 
have validity when applied to the measurement of abilities 
for which it was not intended. 

To these we must add, for the sake of avoiding misunder- 
standing, and not exactly by way of definition, the fact that 
validity is a broader term than reliability (defined later), 
and that validity includes reliability. That is, a valid test 
must of necessity he a reliable test 

The nearest synonyms for validity are “goodness,” “gen- 
eral merit,” and “worthwhileness.” 

It will add still further to our concept of validity to review 
the means by which validity is guaranteed in a test. Some 
of these means must perforce be reserved for Part II of this 
voliune, “How to Construct an Objective Examination.” 
The methods of validating standard tests are very instruc- 
tive, and, although not entirely applicable to the purposes 
of objective test construction, they will help to define our 
terms at this time. 

Principal methods of validating tests. These may be 
stated as follows:^ 

iThe following outline is taken with changes and additions from G. M. Ruch» The 
Improvement of the Written Examination (Chicago: Scott, Foresman and Co,, 1924), 
pp. 14-16. 

For fuller and more technical accounts (with special reference to standard tests), see: 

G. M. Ruch and G. D. Stodda^ Tests and Measurements in High School Instruction 
(Yonkers-on-Hudson, New York: World Book Company, 1927), pp. 301-328, or 

W. S. Monroe, The Theory of Educational Measurements (Bioston: Houghton Miflain 
Co , 1923), pp. 56-105. 
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1. By judgments of competent persons. 

2. By analysis of courses of study or textbooks. 

3. By harmonizing with the recommendations of national 
educational committees or other recognized bodies on cur- 
riciila, courses of study, minimum essentials, etc. 

4. By experimental studies of social utility (such as the 
Horn and Thorndike studies of the most frequently used 
words, the Ashbaugh and Horn studies of spelling lists, the 
studies of Wilson, Woody, et al, on the arithmetic needs of 
business, etc.) 

5. By studies of the most frequently recurring errors. 

6. By the computation of the percentages of pupils answer- 
ing each item correctly at each successive age or grade level. 

7. By correlation against an outside criterion. 

8. By combinations of the above methods. 

Inspection of thd foregoing list will show that the first 
three reduce to the single critmon of expert opinion. Text- 
books, courses of study, and national committee reports are 
not usually to be regarded as resting upon experimental 
bases. Criteria four to seven are experimental in character, 
but these methods tend toward greater refinement than is 
usually possible or necessary in building informal classroom 
tests of the objective type. 

Methods 3, 4, and 5 influence examination construction 
indirectly; i. e., curricula should embody all such empirical 
data, and the content of tests and examinations should follow 
closely the curricula. The Yearbooks of the National Society 
for the Study of Education and the Department of Super- 
intendence of the National Elducation Asssociation have 
placed these and related studies at the disposal of teachers. 

Methods 6 and 7 belong in the field of standard test con- 
struction. Such techniques cannot be applied directly in 
building informal tests. The teacher who wishes real insist 
into such factors as validity and reliability will be rq)aid for 
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some study of validation methods as applied in the con- 
struction of our best mental and educational tests. ^ 

The teacher’s examinations will ordinarily be validated by 
a combination of three criteria; (1) the local course of study, 
(2) analysis of the textbooks employed, and (3) her judgment 
on points of emphasis, inclusion, and exclusion. The guiding 
principle in validation should be: the tests must parallel the 
actual teaching. We may emphasize this point still further, 
and somewhat differently, by saying: any test must represent 
an expensive sampling of the materials of instruction. 

Suggestions for validating tests. Part II of this volume 
will set up a practicable procedure for the actual construc- 
tion of an objective examination. In particular, a plan for 
making a “Table of Specifications” will be presented. It 
is not the purpose to enter upon such a discussion here, 
although brief mention of the value of such a table or outline 
will help to clarify further our concept of validation. The 
following suggestions will aid the conscientious teacher in 
constructing good (valid) tests. 

1. In the course of regular teaching, make a practice of 
jotting down good test items (questions) as they occur to you. 
This will save “racking your brains” when you come to 
examination time. 

2. Place these test items on small bits of paper; 3x5 
library cards are best. Make a file of these questions. 
Secure a filing case and keep these cards. This filing system 
allows the insertion of new items, the discarding of unsuitable 
ones, and easy sorting and arranging of test materials. 
When test items are written or typed consecutively on paper, 
revisions and alterations often necessitate laborious re- 
copying. A card may be inserted or thrown away without 
entailing any other alterations. 


»See Ruch and Stoddard, nU, Part IV, pp. 301-375, or Monroe, op, ciu, pp, 56-105. 
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3. When the time comes to build an examination, draw 
up a Table of Specifications, as directed in Chapter VII. 
This will tend to guarantee a defensible balance of emphaaia , 
freedom from non-essentials, and the inclusion of all im- 
portant topics. 

4. After the test is given, ask the pupils to suggest items 
that were ambiguous, misleading, or not imderstood. You 
will find that from 5 to 10 per cent or more of the items were 
not well worded. These must be revised or thrown away. 

5. Where possible, try to have one or two other teachers 
criticize your test items and rate them for difficulty. 

6. The validity of a test is raised by having the items of a 
proper degree of difficulty. Items passed by every child or 
failed by all contribute nothing to the test. The average of 
two or three teachers’ judgments of difficulty is better than 
the judgment of one, and often is a close approximation 
to the truth. 

7. The validity of a test is increased by having the easiest 
items first and the hardest ones last. 

These suggestions can be carried out in practice without 
great expenditure of time or energy. They Wl go far toward 
guaranteeing a high degree of vahdity to a test or examina- 
tion. There are more exact methods of validation which 
may be employed if still more accurate and worth-while 
tests are desired. These are described in the following 
section. 

The validation of individual test items. In validating 
standard tests, validity and reliability of individual test 
items are experimentally determined by giving the prelimi- 
nary tests to hundreds of pupils in different school grades. 
The percentage of pupils passing each test item is then comput- 
ed. If the percentage of successes on each item rises sharply 
and imiformly from one grade to the next, the item is held 
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to be valid and reliable because it discriminates between 
different levels of ability. If the percentages of successes 
rise and then fall (and perhaps rise again), the item is thrown 
away because it is erratic in its behavior. 

The following principles govern the final selection (valida- 
tion) of test items. Certain of these principles apply strictly 
speaking only to standard test construction, although the 
teacher interested in making informal objective tests may 
occasionally wish to employ these to better her tests. In 
the main, ^e value of these principles will be in the nature 
of the further definition of validity and test validation. 

1. Faulty wording (poor sentence structure, ambiguity of 
meaning, etc.) lessens the validity of a test item. 

2. Test items which can be answered by none of the pupils 
(100 per cent failures) are valueless, and hence invalid. 

3. Test items which are answered by all pupils (100 per 
cent successes) are valueless, and hence invalid. 

4. “Tricky” or “catchy” questions are usually unsatis- 
factory. 

5. A few very easy questions, even if passed by all, are 
justified at the beginning of the test for the sake of encourage- 
ment and motivation of the pupils, although such questions 
may not measure, i. e., discriminate differences in ability. 

6. Full and simple directions, the generous use of samples, 
and fore-exercises (preliminary practice questions, not 
counted in the score) increase the validity of the test items. 

7. Arrangement of the test items from easy to difficult 
increases both the validity and reliability of the test. The 
easiest item should come first and the most difficult last, 
with all intervening items arranged in order of increasing 
difficulty. 

The first of these principles needs no comment except to 
emphasize that teachers must be on their guard to prevent 
loose and ambiguous statements of test items from ruining 
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Otherwise valid materials. Weidemaim^ found violations of 
forty-six rules of grammar, punctuation, diction, spelling, 
etc., in a study of a number of true-false examinations. 

The second and third principles together are intended to 
guard against the inclusion of worthless materials upon a 
basis of too great ease or difficulty. Consider three tests as 
follows: 

Test A: 100 items, each answered correctly by every pupil. 

Test B: 100 items, each failed by every pupil. 

Test C: 100 items graduated in difficulty from items 
passed by all to items failed by all. 

Every pupil would receive 100 on Test A. Every pupil 
would receive 0 on Test B. On Test C on the other hand, 
the scores of a class would vary from something above 0 
to almost 100. Good pupils would make high scores and 
poor pupils low scores. We would say that Test C measures 
achievement but that Tests A and B are worthless (invalid) 
as measures. 

It might be answered that Test A is too easy and Test B 
is too difficult. Exactly! But this condition is one kind of 
invalidity. The expert in measurement would say that such 
tests do not discriminate differences in ability. This gives us 
another point of view of the nature of validity, viz., that 
valid tests arrange (or rank) pupils on a scale of ability. 

Granting all this, how is a teacher to eliminate items that 
are functionless because they are passed or failed by aU? 
There are two ways to do this: (o) by her judgment, a rough- 
and-ready but fer from valudess method; and (6) by giving 
the test, and by means of tabulations determining which 
items are too h^d or too easy to function. Keep in mind, 
however, that a few very easy items (passed by 95% to 
,100%) are desirable for the sake of encouragement, and that 

»C. C. Weidemaim, ''How to Construct the True-False Examination,” Teachers College 
Cofartbulimts to Education, No. 225 (N. Y.; Columbia University, 1926). 
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a few very diffiailt items (passed by 0% to 5%) should be 
placed at the end of the test to give it “top,” or sufficient 
difiiculty to prevent perfect scores (non-measurement). 
Ordinarily, judgment plus later inspection of the test papers 
will suffice to eliminate most of the functionless material. 

The last one of the list of principles for assuring validity 
calls for some comment. In making a standard test elaborate 
experimental try-outs are made of the test items in order to 
determine their degrees of difficulty. The tests are given, 
scored, and the number of successes (or failures) is tabulated 
for each item. The easiest one is then placed first, followed 
by the next easiest, and so on until the most difficult is 
placed last. At the same time any excess of items passed by 
all or failed by all is corrected as previously described. 

Assume that we have 100 items of varying difficvilty with 
which to make a test. Suppose that we make up this test in 
two editions, as follows: 

Edition I: Items arranged in exact order of difficulty, the 
easiest ones first; then gradually increasing in difficulty until 
the last item is the most difficult. 

Edition II: Items arranged in the order determined by 
placing them aU in a hat and drawing them one at a time 
tmtil all are drawn (strictly chance order). 

Figure 1 illustrates graphically the two editions of this 
hypothetical test. Each vertical line represents a test item. 
The height of the vertical bars indicates the difficulty of 
the items (m terms of per cents of failures). A short fine 
means an easy item; a long line a very difficult one. 

Distribute Editions I and II in chance order to a class of 
pupils so that half receive I and half receive II. Now it 
might readily happen that a very difficult item (e. g.. Item 
No. 4) occurs as one of the very first items in Edition II. 
Those pupils who received Edition II would “hang up” on 
this hard item; i. e., they would waste time on it, and they 



Degree of Difficulty Degree of Difficulty 



Fig. 1. — Illustrating the arrangement of test items in increasing order 
of difficulty (Edition I) and in chance order (Edition II). The length of 
the vertical bars indicates the degree of difficulty. 


niight fail to answer it even after many minutes of thought. 
Had they skipped it or had that item been at the end of the 
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test where it belongs, they might have answered a dozen 
easier items which were further along in the test while they 
were puzzling over this one difficult point. Progress through 
such a test would be by jerks, a run of easy items answered 
in a few seconds, then perhaps minutes of delay, then more 
rapid progress, a second long delay, and so on. 

The pupils receiving Edition I would move along regularly 
through the easy items at the beginning of the test. The 
items would become gradually and almost imperceptibly 
more difficult. They would finally work, with few delays, 
into the level of the test where they would be stopped by the 
limit of their abilities. The time allowed would be better 
distributed because they met the easy items first, answered 
these rapidly, and had most of their time left to think about 
the puzzling and difficult items. 

This rough picture of the differences brought about by 
good and bad arrangements of test items shows one of the 
reasons why well-made standard tests are superior to most 
other forms of examinations. 

It is not intended that teachers should gain the impression 
that all of their tests should be subjected to the laborious 
and expensive experimentation necessary to the correct 
arrangement of items in order of difficulty. Present dis- 
cussions are directed at laying the basis for real insight into 
and understanding of test validation methods. The arrange- 
ment of test items in order of difficulty can be done reason- 
ably well by the judgment of teachers. If two or more 
teachers can co-operate in obtaining such judgments, the 
average rating for difficulty is almost certain to be con- 
siderably better than a single teacher’s judgment.* 

Jin long examinations test items are often grouped by topics, each major division of 
subject-matter constituting a separate part of the test. This is often done to facilitate diag- 
nosis. In such an event the arrangement of items hrom easy to difficult may be carried 
out within each part. 

In this connection the author feels it necessary to remind the test maker that if such 
serrate p^ are to yield diagnostic values, each part must be made long enough to form 
a highly reliable test standing alone. In other words, if a test is to yi^d four separate marks 
or scores on a given pupil, then each part must be made as reliable as would ordinarily be 
necessary for an entire examination. 
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An experimental method of Talidating individual test 
items. In case a teacher wishes to validate test items with 
a greater degree of refinement than the method of individual 
judgments or average of several judgments, there is a reason- 
ably simple method which is quite serviceable. The steps 
are as follows: 

1. Make up the test items, arranging them by inspection 
in order of dfficulty. 

2. Give the test to the class, allowing time for all to at- 
tempt every item. 

3. Score the test. 

4. Arrange the papers in order of size of scores. 

5. Find the median mark and separate the papers into two 
classes: (a) those above the median, and (6) those below 
the median. Call the first group the “good” pupils and the 
second group the “poor” pupils. 

6. Tabulate the number of pupils passing (or failing) each 
individual test item, keeping separate tabulations for the 
“good” and “poor” groups. Express the passes (or failures) 
in per cents. 

7. Study the per cents for “good” and “poor” groups. 
Reject items where the “poor” group shows percentages of 
successes as high as or higher than the “good” group. 
Such items do not differentiate abilities. The best items 
will show the largest differences in successes in favor of the 
“good” group. 

We can illustrate this method by an example. Suppose 
we have carried out the above seven steps and obtain results 
like those of Table 1 on page 38. 

Item 1 is neither too easy nor too difficult. In this respect 
it is satisfactory. But it does not differentiate between 
high abilities and low abilities. This item does not injure 
the test, but it might be replaced to advantage by an item 
showing greater differentiation. 
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Compare Item 1 with Item 2. They show the same diffi- 
culty when both groups are considered. Item 2 is greatly 
superior to Item 1 because it discriminates between pupils 
of high levels of ability and those of low levels of ability. 
Items like No. 2 will make a more valid test than items like 
No. 1. 


TABLE 1 


Per Cents of “Good” and “Poor” Pupils Answering Individual 
Items of a Test 


Item 

Per Cent of Correct Answers 

“Good” Group 

“Poor” Group 

Both Groups 

1 

14 

■ 14 

14 

2 

21 

7 

14 

3 

0 

6 

3 

4 

84 

16 

50 

5 

53 

49 

51 

6 

100 

98 

99 

7 

0 

0 

0 

8 

100 

100 

100 

9 

0 

8 

4 

10 

50 

50 

50 

Etc 





Item 3 should be discarded. It is rather difficult, and 
backward pupils do better on it than really superior pupils. 
To throw this item out will increase the v^dity of the test. 
The same comments apply to Item 9. 

Item 4 is a good one. It discriminates sharply between 
good and poor pupils. It is well within the abilities of both 
groups. 

Item 5 has about the same average difficulty as Item 4, but 
it is greatly inferior because low-grade pupils do almost as 
well on it as high-grade pupils. Like Item 1 it does not hurt 
the test, but replacing it with one like Item 4 will raise the 
validity. 
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Item 6 is very easy. A few such items may be kept in 
order to encourage the pupils. If so, they should form the 
first items of the test. Only a few such should be kept as 
they are too easy to help much in measuring, and “poor” 
pupils do almost as well as “good” on such tasks. 

Items like No. 7 should be thrown away except for a few 
to give “top” to the test, i. e., to prevent perfect scores. 

Item 8 is similar to Item 6, although even less difficult. 

Item 9 is too hard and also does not discriminate well 
between different levels of ability. It probably should be 
discarded. 

Item 10 suffers from the same fault as Item 5. It is by no 
means as valuable as Item 4, which is of about the same 
difficulty. It may be retained, although better items can 
be fotmd. 

As has been stated, the method of experimental validation 
of individual items is usually employed only when standard 
tests are under construction. Certain informal classroom 
tests may justify the effort expended in making a study of 
the values of individual items. A defensible procedure has 
been given to cover such cases. In most cases it will not be 
necessary to resort to such elaborate methods. The judg- 
ments of one or more teachers will usually accomplish a 
degree of refinement commensurate with the needs of the 
average test. One more thing should be kept constantly in 
mind; viz., long tests may be expected to be more valid than 
short tests, and if a test is made long enough, it will usually 
yield a reasonably valid measure even if many individual 
items are faulty or worthless. 

The preceding pages have served as zm introduction to the 
main concept of validity. We will return to this topic after 
the m eaning of reliability has been considered, and again in 
Part II where we consider the actual plan for building valid 
and reliable tests. 
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Summary of concept of validity. The following summa- 
rizing statements may serve to bring together and clinch the 
various ideas advanced as relating to validity: 

1. Validity is the degree to which an examination measures 
what it is claimed to measure. 

2. Validity includes reliability as well. 

3. Validity may be defined as the degree to which the test 
parallels the actual flow of instruction, and of the care 
exercised in choosing important materials, in excluding non- 
essentials, and in producing a correct balance of the materials 
used in the examination. 

4. In the usual school examination, validation rests largely 
on competent opinion. Standard tests, on the other hand, 
are validated, in part, by controlled experimental methods. 

5. Validation of tests and examinations will be facilitated 
by building a skeleton of “Table of Specifications” before 
actual work is begun on the test. (See Chapter VII.) 

6. Test items passed by all or failed by aU have no validity. 

7. The validity of an examination is increased by arrang- 
ing the items in order of increasing difficulty. 

8. Validity of test items is reduced when the statements 
are ambiguous, faulty in sentence structure, or otherwise 
not clear. 

9. The greater the difference between the percentages of 
successes of strong and weak pupils on a test item, the more 
valid (discriminating) is the item. 

10. Long tests tend to be more valid than short ones. 

REUABILITY 

Definition of reliability. Reliability is second only to 
validity as a criterion of the worth of a test or examination. 
We might say that the second most important fact which 
we can know about a test is the degree of reliability which it 
possesses. As in the case of validity, a number of statements 
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are given below, which, collectively and individually, serve 
to define the concept. 

1. Reliability refers to the degree to which a test measures 
whatever it does measure; not necessarily what it is claimed 
to measure. 

2. Reliability refers to the degree of accuracy of measure- 
ment. 

3. Reliability refers to the amount of confidence that may 
be placed in the mark or score on a test as a measure of some 
ability of a pupil. 

4. Reliability is one aspect of validity. A valid test is 
necessarily reliable, but a reliable test need not have high 
validity, or for that matter have any validity at all for a 
particular purpose. 

5. Reliability refers to the stability of an estimate of a 
pupil’s ability from one sampling to another. For example, 
if a certain standard test is marketed with six equal or 
equivalent forms, this test is not reliable if the fluctuations 
of pupils’ scores from one form to the next is very large. 

6. If a valid and reliable test in one subject (say, geog- 
raphy) is given and the results are labeled as of another 
subject (say, manual training), the actual scores may remain 
reliable but be entirely invalidated by such misnaming. 
Thus a thermometer used to measure the velocity of the 
wind will retain whatever degree of accuracy (reliability) 
of measurement was built into the instrument by the manu- 
facturer, but the readings, however accurate, will be invalid 
measures of wind velocity. We see that in one sense validity 
is more specific than reliability; i. e., misuse of a test may 
injure its validity more than its reliability. 

Reliability is thus seen to be a more restricted term than 
validity; it is one aspect of validity. Validity implies re- 
liability, but the converse is not necessarily true. This 
point is often confusing, but later sections of this volume 
may help to clarify these relationships. 
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Principal methods of insuring reliability to tests. There 
is little to be added to the discussion already given in the 
sections dealing with validity. Validity and reliability are 
so closely related, so far as actual test construction is con- 
cerned, that provision for the former in large measure takes 
care of both. The best approach to the technique of in- 
suring reliability in tests and examinations is a considera- 
tion of the two principal means of guaranteeing reliability, 
viz.: 

1. Objectivity of scoring or evaluating. 

2. Character of the sampling included in the test items. 

Objectivity as a means toward reliability. The objections 
to the traditional examination have very largely been 
centered about the variations which are boimd to occur when 
equally competent teachers mark the same examination 
papers. Such variations obviously teU nothing about the 
merit of the paper. 

They constitute on the other hand a fertile source of un- 
reliability. We have termed such unreliability, unreliability 
due to subjectivity. In contrast to questions open to personal 
differences in the marking, there are many types of questions 
or items which do not permit even the slightest differences of 
opinion in deciding whether the answer is correct or incorrect. 
Such test items are termed objective. Examples of objective 
test methods are the familiar true-false, multiple-choice, 
matching tests, etc. Less perfectly objective are the com- 
pletion tests, which, with care, may be so phrased as to be as 
objective as practical requirements suggest. 

Example A shows a subjective type of question. Example 
B, in contrast, is an almost purely objective arrangement of 
the same test material. Example A was one of five questions 
in an examination in physiology, twenty points being allowed 
for a perfect answer. Example B provides twenty blanks, 
each one correctly completed to give one point of credit. 
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Example A 

Trace the complete process of the digestion of food from the time it 
enters the mouth tmtil the waste products are eliminated. 

Example B 

The mouth is concerned with digestion in two ways: first, the grind- 


ing action of the , and second, the chemical action 

of the enzyme , which acts on 

changing them into In the stomach the most im- 
portant enzyme is , which starts the digestion of the 


The gastric juice also contains an 

which helps to kill the bacteria causing fermentation. The small intestine, 

which is a coiled tube about feet in length, secretes a 

digestive jmce itself as well as receiving the juices from the 

and the The bile aids chiefly in the digestion of 

and in destroying acids from the stomach. The most 

important digestive juice in many ways is that of the , 

which contains three important enzymes which act on 

, and The absorption of the digested 

foods is aided by the many finger-like projections in the 

intestine known as The waste materials collect in the 

and pass on out. If this waste 

material is not rapidly cleared out, are formed which 

cause disease. 

Example B is not entirely objective, but study of the possi- 
ble insertions on each blank in turn will show that there are 
very few answers of merit which could be written in on any 
bla^. If 100 teachers should grade a given pupil’s answers 
to this exercise it is to be doubted whether the greatest 
difference among these hundred teachers would be as much 
as five points. Moreover, if a set of scoring rules were drawn 
up and adhered to in the marking, the variation among 100 
teachers might be reduced to one or two points at most. In 
a long examination, the variation represented by one or two 
points in twenty is of slight moment. 

In Chapter III we shall see that questions similar to 
Example A often show disagreements as great as firom three 
to twenty points when 100 teachers grade the same answer. 
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We coTxld eliminate all subjectivity in grading the question 
under discussion by making a true-false, multiple-choice, or 
simple recall test covering the same points. Example C 
below shows a twenty-item true-false test covering sub- 


stantially the same ground as Examples A and B. 

Example C 

1. The principal function of the teeth in digestion is to 


grind up the food. 

True 

False^ 

2. The salivary juice contains an enzyme called pepsin. True 

False 

3. The enzyme of the saliva acts on starches. 

True 

False 

4. Ptyalin changes sugars into starches. 

True 

False 

5. The gastric juice contains an enzyme known as ptyalin.True 

False 

6. The gastric juice starts the digestion of protein. 

True 

False 

7. The gastric juice contains hydrochloric acid. 

8. The add in the stomach helps to destroy bacteria 

True 

False 

causing fermentation. 

True 

False 

9. The small intestine is about six feet long. 

10. The small intestine receives three prindpal digestive 

True 

False 

juices. 

True 

False 

11. The bile digests mainly carbohydrates. 

12. The most important digestive juice is that from the 

True 

False 

liver. 

13. Pancreatic juice contains three important enzymes, 

True 

False 

acting, respectively, on starches, protein, and fats. 

14, The protein-digesting enzyme of the pancreas is 

True 

False 

called amylopsin. 

True 

False 

15. Lipase acts on fats. 

16. The finger-like processes on the walls of the small 

True 

False 

intestine are called cilia. 

True 

False 

17. The villi are absorptive organs. 

18. Undigested materials are stored in the large intestine 

True 

False 

until needed. 

19. Important digestive enzymes are formed in the walls 

True 

False 

of the large intestme. 

20, If the large intestine fails to act, toxins are formed 

True 

False 

which may cause disease. 

True 

False 


iThe practice of printing the words “true"' and “false"" at the right of each statement is 
not the best method of arranging true-false tests except, perhaps, with very young pupils. 
Later sections of this volume show other plans which penmt much more rapid scoring. 
For example, the signs + and — or -f and 0 may be used to indicate true and false state- 
ments respectively. 
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Example C is perfectly objective and may be scored by a 
clerk entirely ignorant of physiology with quite accurate 
results provided a scoring key is furnished and reasonable 
care is exercised to avoid mistakes of carelessness. Example 
C may not prove to be more reliable than Example B, but 
both B and C are almost certain to be superior to Example A. 
Example B is not quite objective, but this limitation may 
be of less weight than is the unreliability introduced into the 
true-false test (Example C) by the opportrmity for guessing. 
The exact merits of these three forms of what is substantially 
the same subject-matter need not be settled here. The point 
at issue is that of defining objectivity of scoring, and of 
showing its relation to reliability. 

Chapter III will take up the experimental evidence on 
the question of unreliability arising from subjectivity in 
detail. 

Reliability as affected by sampling. The term sampling, 
and perhaps to some extent the idea of sampling, is ordinarily 
not highly conscious in the mind of the teacher in thinking 
about examinations and examination procedures. This is 
not altogether true, as teachers recognize that a two-, three-, 
or even a five-question examination of the traditional sort 
is not entirely adequate. They prefer that at least ten 
questions be asked in an examination of any considerable 
importance. If asked the reason why a two-question ex- 
amination is not as desirable as a ten-question test, the 
average teacher will answer, and rightly, “The former is 
too short.” This is equivalent to sa3dng that a two-question 
examination is not very reliable. Reliability is thus a new 
term for an old, but not thoroughly analyzed, idea. This 
brings us to the conclusion that examinations should be 
relatively long. Why? Because a long examination is a 
more adequate sampling than a short one, all other factors 
being equd. 
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A short examination suffers from many defects. Prom- 
inent among these are such facts as: (a) a short examination 
does not cover the ground thoroughly; (b) a short examina- 
tion places too much premium on the knowledge of the 
particular ground covered but tells nothing about other 
equally important divisions of the subject-matter; and (c) a 
short examination penalizes unduly a pupil who happens to 
have little knowledge of the particular questions but has 
otherwise a good knowledge of the subject, or, conversely, 
a short examination introduces an element of luck in that a 
pupil might know the particular questions asked but be 
sh^y on others equally important but unasked. 

A test or examination is always a sample. Measurement 
is never complete; it always represents a sample of abilities. 
Other things being equal, the longer a test, the more adequate 
the sampling. The more adequate the sampling, the more 
reliable (and hence indirectly the more valid) the test. 

Since testing is sampling, any test score involves a certain 
amount of error of measurement. To say that a test is un- 
reliable is synonymous with stating that it has a large error 
of measurement. Reliability is accuracy of measurement. 
As a test is made longer and longer, it becomes more and 
more reliable, provided the test is increased by equally good 
test items. This idea is not new. Teachers are aware that 
the practice of basing term marks on a single examination is 
dangerous (unreliable). The average of two examination 
marks is safer. The average of many examination grades 
is stUl better. We can think of two examinations as one 
examination doubled. We can regard the average of ten 
examinations as one examination ten times as long as any 
one of the ten. 

If we accept the point of view that increasing the length 
of a test raises its reliability, and that this process may be 
continued without limit (in theory at least), we arrive at a 
point of view that: A test infinitely long is perfectly reliable. 
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To increase the reliability of a test, ordinarily it is sufficient 
to lengthen it, i. e., extend the sampling. 

We can find an analogy to the unreliability of a very short 
test (narrow sample) in the hypothetical ejcperience of a 
teacher. Suppose that it is now April or May and the 
teachers of your school system are about to receive notices 
of retention or dismissal for the following year. In order 
to form a basis for recommendations to the board of educa- 
tion your superintendent decides to base his judgment upon 
an unannounced five-minute visit to each teacher’s class- 
room. Upon the basis of his observation of a teacher’s 
classroom activities for five minutes, he decides to “drop” 
her. In another classroom he finds a very interesting recita- 
tion in progress (perhaps the only good one all week). He 
recommends that this teacher receive a $200 increase the 
next year. And so on. 

Would teachers feel the justice of this plan? A five-minute 
visit is fundamentally similar to a five-minute test — ^it is 
too short to afford a reliable basis for important judgments. 
But if the observational classroom visits were made fre- 
quently enough to sample the general run of the activities of 
a teacher, in the long run something could be told about the 
general worth of the teacher. Or, if the superintendent, 
principal, and other general supervisors confer and exchange 
notes, the combined judgment of several such persons will 
have more value than that of the superintendent alone. 

The position has been taken repeatedly in the preceding 
pages that measurement is always limited sampling, but it is 
only fair to state that at least one writer on educational 
measuranent. Dr. C. W. Odell, challenges this view.^ 

JC. W. Oddl» in reviewing the author's Jmbrovement of the Written Examination, in the 
Journal of Educational Researct^ December, 1925, p. 42, says: “ . , . Ruch states that ‘measure- 
ment is always sampling,' again ‘that an examination of ten questions, or worse still, five 
questions ...tea very smidl sampling is evident.’ Although both statements express general 
truths, the reviewer does not believe that they always nold. In a limited ndd one can 
secure a complete measure of ability. For example, a test may indude all the addition 
combinations of simple digits. Likewise ten or even five questions may give more than a 
very small sampling if they are topical or in some other way call for large amounts of 
material.” 
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One instance raised as a criticism of the view that measure- 
ment is always limited sampling is the possibility of complete 
measurement of a narrow ability or function like the 100 
addition facts. Let us consider this case, as it is a common 
school situation and an excellent illustration of the differ- 
ences of opinion, if any, between Odell and the author. 

Complete vs. incomplete measurement. There are but 
100 basic addition facts when we include the zero combina- 
tions and call both orders of addition of any two digits (e. g., 
6-1-4, and 4-f6) different facts, which modem authorities 
agree upon as necessary. 

The author cares little about who is right or wrong so far 
as the controversy is concerned, but the illustration is an 
excellent one for bringing out certain concepts related to 
sampling and unreliability of measurement. The reader 
may choose between the two views at will, but the difference 
of opinion bears upon our present definition of unreliability. 

The following simple experiment was carried out as an 
illustration:^ 

1. The one hundred basic addition facts were placed on 
one hundred small pieces of cardboard of uniform size. 

2. The cardboards were then shuffled thoroughly. 

3. They were then drawn, one at a time, and placed in 
the order of drawing as a test. This was called Form A. 

4. The shuffling and drawing process was repeated four 
times to yield Forms B, C, D, and E. 

5. The five forms (A to E) were mimeographed and 
administered to two groups of pupils on five successive days. 

Note: Class X was a beginning class which had not as yet thoroughly 
mastered the addition facts. Class Y was a strong third-grade class. 

6. The scores obtained were tabulated. This tabulation is 
given here, all incomplete sets being thrown out. 

iBy Miss Celia Gifford, Sunshine School, Berkeley, California. 
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It should be noted that all five examinations contained 
exactly the same combinations {all the combinations), the 
order only being different from form to form. Each of the 
five forms was therefore a “complete sampling” in the sense 
used by Dr. Odell. 


Pupil 

Form A 

Form B 

Form C 

Form D 

Form E 

Class X 

1 

99 

100 

100 

100 

99 

2 

86 

98 

100 

100 

lOO 

3 

89 

100 

99 

100 

99 

4 

100 

98 

100 

100 

lOO 

5 

100 

95 

100 

100 

100 

6 

100 

100 

99 

98 

96 

7 

87 

98 

99 

100 

lOO 

8 

77 

100 

100 

99 

100 

9 

78 

32 

94 

57 

76 

10 

90 

98 

93 

80 

99 

11 

18 

90 

39 

38 

47 

12. . 

79 

79 

90 

91 

91 


Class Y 


13 

100 

100 

100 

97 

99 

14 

94 

98 

93 

94 

93 

15 

97 

99 

97 

98 

100 

16 

100 

100 

92 

100 

100 

17 

98 

100 

100 

98 

98 

18 

93 

100 

98 

97 

98 

19 

99 

100 

99 

99 

99 

20 

87 

93 

90 

87 

90 

21 

99 

100 

99 

99 

99 

22 

99 

96 

93 

96 

92 

23 

99 

99 

100 

100 

100 

24 

98 

99 

100 

100 

100 

25 

88 

78 

81 

76 

89 

26 

99 

100 

100 

100 

100 

27 

89 

90 

94 

92 

86 

28 

100 

98 

100 

100 

98 

29 

100 i 

100 i 

98 

100 

95 

30 

100 

100 

99 

100 

100 


There are several instances of perfect agreement on two, 
three, or four forms. There was not a single case where a pupil 
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earned exactly the same scores on all fye forms. Pupils 26 and 
30 made but one error each in 500 attempts. These single 
errors are unquestionably the results of temporary “slips” 
or confusion. Pupil 16 was consistent except on Form C 
where he “fell down” noticeably. Pupils 9 and 11 are imac- 
countably erratic. We can but wonder what the causes 
were. Differences of five or more points between any two 
forms are very common, e. g.. Pupils 2, 3, 5, 7, 8, 9, 10, 11, 
12, 14, 16, 18, 20, 22, 25, 27, and 29. 

It is to be expected that Class X would show larger fluc- 
tuations than would Class Y since the former were beginners 
in arithmetic and the latter had had more than a year of 
instruction. The magnitude of the disagreements, at times, 
comes as a surprise to anyone who has not actually carried 
out the equivalent of our procedure. 

We are now ready to generalize. The concept of sampling 
is dual in character. It includes: 

1. Limited sampling of the actual subject-matter. It is 
possible in the case of narrow functions to eliminate this 
source of unreliability (as in the case of the present illustra- 
tion). 

2. Limited sampling or unreliability arising from psycho- 
logical factors inherent in the mind of the pupil under ex- 
amination. Such psychological factors include carelessness, 
undue haste, lapses, of attention, state of health, fluctuations 
in effort, variations in motivation, etc. 

The more practical point of view would seem to be that 
sampling is never complete until fluctuations in performance 
{the pupils’ scores) are completely stabilized, even if the subject- 
matter is completely covered. Stabilization was not com- 
pleted in our experiment even after five administrations of 
the same (?) test, although for all practical purposes the 
average score on the five forms would be a far more reliable 
measure than we can ordinarily expect to obtain in school 
work. 
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The criticism really reduces to a matter of definition. 
Dr. Odell seems to prefer the narrower point of view; the 
author has insisted on the dual view de:^ed above. ^ The 
main point of interest in the present discussion attaches to 
the isolation and emphasizing of the fact that fluctuations 
of performance arise from psychological factors, and that 
these variations enter into the test scores and hence cause 
unreliability of a single measurement. 

It is to be doubted whether it is possible to construct five 
or ten questions which will sample even half of the field 
ordinarily covered by a major examination. To be sure, one 
might make up an examination in physiology somewhat as 
follows: 

I. Describe fully the cell basis of the human body. 

II. Discuss the complete process of digestion. 

III. Name the principal bones, tell how the skeleton is articulated, and 
describe its functions. 

IV. Give in detail the facts about respiration, the respiratory organs, 
the relation of respiration to drculation and tissue-building, etc. 

V. Etc. 

Since physiology texts are often organized around from 
ten to twenty main topics like cell structure, the skeleton, 
circulation, respiration, the nervous system, germ diseases, 
etc., it would be theoretically possible to “cover” the entire 
field in ten or twenty questions. But in what sense can it 
be said that the subject has been “covered” (sampled com- 
pletely)? The main topics are all in to be sxire. But will 
the pupil write everything that is important about all ten 
or twenty? Will he write a half, a third, a fifth, or a tenth, 
or less of what he actually knows about digestion? The 
general nature of the zmswers to these questions must be 
evident by now. The following pages report some hitherto 
unpublished experimental evidence on this point. 

lAfter this manuscript had gone to press, IDr. Odcll's^OTlendid treatment of the objective 
examination was publiwed under the title. Traditional Examinaiums and NeuhType Tests. 
From a discussion appearing therein ^p. 42) it seems that Oddi and the ixesent wnter are 
now m substantial agreement on the issue of ^‘complete** sampling. 
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Talbott’s study of the sampling afforded by essay ex- 
aminations.^ E. O. Talbott, working under the direction of 
the author, has carried out a study of the traditional ex- 
amination from the standpoint of sampling. His specific 
problem is: what fraction of a pupil’s knowledge is elicited 
when he is asked to “Discuss fully” (or some equivalent 
phrasing) a given topic? 

To approach a rough answer Talbott proceeded as follows: 

1. He gave an essay examination of from three to ten 
broad discussion questions. 

2. The pupils wrote their answers, unlimited time being 
given, although accurate records of actual working times 
were kept. 

3. The pupils took immediately a long objective test on 
each topic represented by an essay question in the first 
examination. Working time was again recorded. (The 
order of essay and objective tests was alternated from one 
experiment to the next.) 

4. The essay questions were scored for the number of 
ideas or facts written by the pupil. 

5. The objective tests were scored in such a way as to 
correct for chance or guessed successes. 

6. The ratio between essay scores and objective scores 
was computed for each pupil, and similarly for working time 
ratios. 

This is of covuse a very rough procedure, although one 
which is far from valueless in forming general notions about 
the merits of old- and new-type examinations as devices 
for sampling completely the knowledge or skills possessed 
by pupils. The errors of the method are likely to be quite 
conservative since it is obvious that no objective test could 
cover everything that a pupil might know. If the essay 
papers are scored generously, as in Talbott’s study, the 
ratios are likely to be smaller, not larger, than the truth. 

^Unpublished. To appear in the Journal of Educational Research. 
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Table 2 on page 54 presents a summary of the findings. 
The first line of entries refers to an examination in eighth- 
grade geography. It will help to describe Talbott’s pro- 
cedure to discuss this geography examination briefly. 

The essay test consisted of the following questions: 

I. Discuss the continent of Europe. 

II. Discuss the green Northlands. 

III. Discuss fully Ireland, Scotland, and England. 

The objective test covering the same ground included 247 
items, 201 being true-false and 46 simple completion exer- 
cises. Samples of these items follow: 

1. Europe is next to the smallest inhabited continent. True False 

2. Europeans and those who have gone recently from 

Europe rule most of the world. True False 

3. Europeans have settled North and South America and 

Australia. True False 

4. The numerous lakes of Ireland and Scotland were 

probably caused by 

The completion tests were scored simply “number cor- 
rect,” but the true-false were scored “rights-minus-wrongs” 
in order to allow for chance or guessed successes. 

Talbott’s method of scoring the essay questions may be 
shown by a portion of an actual paper. He read each paper 
carefully, drawing vertical lines at the end of each separate 
thought or fact to the best of his judgment, attempting to 
be generous and giving no iienalties for errors of spelling, 
gr amm ar, etc. No credit was given for incorrect statements. 
Thus: 

Europe is just east of U. S. | There are many manaured (manufactured?) 
things sent out ] and lots of imports. | Some of this cotmtry of Europe is 
ruled by a King j and others just about the same as U. S. 1 Etc. 

Five credits were given for this portion, although the first 
statement is somewhat questionable. 

We are now ready to examine Talbott’s findings. 
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The row of averages at the bottom of Table 2 is best for 
purposes of drawing probable conclusions. It seems likely 
that the essay test calls forth less than half (.44) of the pupil’s 
real knowledge of the subject. To elicit this two-fifths or 
half of a pupil’s knowledge required two times as much 
working time (2.01) as was needed for the long objective 
test. It thus appears that the objective tests used were 
four or five times as efficient as the essay questions as devices 
for sampling when we take into account both average work- 
ing times and sampling ratios. There can be little doubt 
that the objective test is the more efficient sampling method 
per xmit of working time. 


TABLE 2 


Sexected Results From Talbott’s Stxjdy of the Adequacy of the 
Traditional Examination as a Device for Sampling 


Subject 

Sampling Ratios 

T 

Time Ratios (■^) 

\ l o' 

N 

Highest 

Lowest 

Average 

Highest 

Lowest 

Average 

Geography (Grade 8) , 

.98 

.28 

.66 

2.76 

.61 

1.31 

17 

Histo^ (Grade 8) — 

.90 

.21 

.44 

7.30 

1.13 

mimm 

17 

U. S. fiistory I 

.84 

.37 

.51 

1.91 

.93 

1.53 

14 

U. S. History II 

.69 

.21 

.36 

2.40 

.91 

1.85 

20 

Chemistry I 

.81 

.24 

.41 

2.76 

.76 

1.90 

15 

Chemistry II 

.64 

.22 

.37 

3.20 

.94 

1.96 

20 

Citizenship I 

.92 

.25 

.41 

5.18 


2.03 

18 

Citizenship II 

.68 

.24 

.38 

2.36 


1.73 

19 

General Science I . . . . 

.72 

.22 

.47 

4.07 


3.05 

13 

General Science II 

.56 

.23 

.40 

2.16 



14 

Averages of Columns. 

.77 

.25 

.44 

3.41 

.91 

2.01 

(Total) 

167 


Table 2 shows that in certain individual cases (“Highest 
Column’’), pupils did write from two-thirds to practically 
all that they toew tmder the stimulus “discuss fully.’’ At 
the same time the “Lowest” ratios fell as low as one-fifth 
(.21) to less than two-fifths (.37). In the latter cases the 
directions to “tell all you know” guaranteed very little. 
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Talbott’s results must not be taken as final, especially in 
view of the obvious crudities of the method. Thus far he 
has studied twenty elementary and high-school classes in 
two school systems with consistent results. More work is 
needed on this and similar problems before reaching final 
decisions. It is somewhat surprising that so little critical 
work has been done on so many of the claims of the tradi- 
tional examination. 

Unreliability usually due to both subjectivity and limited 
sampling. Unreliability has been treated as falling into two 
categories: (a) unreliability due to subjectivity and (b) un- 
reliability due to limited sampling. These two sorts of xm- 
reliability are hard to differentiate in many kinds of examina- 
tions. They usually are found to operate together in the 
traditional examination. In standard tests and objective 
classroom tests, subjectivity may disappear completely, but 
limited sampling will remain as a disturbing factor. We 
may consider four sorts of tests by way of illustration: 

Test A: 10 discussion (essay) questions on United States 
history. 

Test B: 1000 discussion questions on human physiology. 

Test C: 10 true-false questions on grammar. 

Test D: 250 multiple-choice questions on geography. 

Test A is a very common type. It is limited in scope since 
but ten questions are employed. These are discussional in 
character and hence open to differences of opinion in scoring. 
Such a test is ordinarily but moderately rdiable. It is 
subject to both sources of unreliability, limited sampling and 
subjectivity of marking. 

Test B is hardly a practical illustration. Such a test 
would take many days. It would be comprehensive, and 
for all practical purposes limited sampling would not enter. 
Subjectivity of scoring would remain. 
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Test C is objective. If 1000 teachers should mark the 
same pupil’s paper, the agreement would be perfect except 
for occasional errors of carelessness. But it would be a very 
inadequate sample of a pupil’s achievement in grammar. 

Test D is both highly objective (probably perfectly so) 
and a reasonably wide sample. It might be the best test in 
the lot. Test B, in theory, might be better, but it would be 
impossible to administer such a test in practice. If well 
made. Test D is a reasonable approximation to the elimina- 
tion of both sources of unreliability. Test D might be given 
in sixty to ninety minutes, a justifiable expenditure of time. 

Objectivity is attainable by a change, where possible, to 
the new-tjqDe examination. Subjectivity cannot be elimi- 
nated from the essay-type test by any method within the 
limits of practicability. Two solutions have been proposed 
for remedying the unreliability of the traditional examination 
without its abandonment: (1) having a number of teachers 
grade the same papers and taking the average as the mark; 
(2) using a written set of rules for scoring all doubtful situa- 
tions. The first mentioned remedy is imdeniably efficacious. 
The difficulty is that it might take dozens of teachers to 
reduce the personal judgments to a stable basis or average. 
This method will not work out in practice, as is evident. 
The second method is somewhat more promising. It will 
be considered further in Chapter III. The experimental 
evidence to date makes it questionable whether the refine- 
ment possible in this direction will be sufficient to eliminate 
the main force of the objection to subjective examinations. 

A sampling theory of examinations.^ “Examination 
practices at the present time make use of two more or less 
distinct theories of sampling. The first of these may be 
called the intensive sampling and is represented by the 
traditional essay type of examination. The second, or 

Hooted from G. M. Ruch et aU Objective Examination Methods in the Social Studies 
(Chicago: Scott, Foresman and Co., 1926), pp, 12-14. 
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extensive sampling, is characteristic of the more recent new- 
type or objective examinations. The former examination 
usually consists of five or ten questions which are to be 
answered exhaustively. The latter is more likely to comprise 
from 50 to 250 or more narrow questions which sample 
widely but not intensively. 

“It is not always borne in mind that any examination is 
at best a limited sampling of the total field which might be 
covered by the examination. Testing is therefore invariably 
partial, never complete. Unreliability due to sampling can 
be reduced by increasing the number of questions asked. 
Theoretically, an examination is perfectly reliable only when 
the sampling is infinitely long. 

“The two theories of sampling as applied to examinations 
may be illustrated by the following scheme: 

“Let A, B, C, D, . . . N below represent the topics in a 
particular school subject. Each should be suitable for one 
broad question of the usual type. Let 1, 2, 3, . . . n, 
represent single facts or items of information, etc., falling 
under each of the topics denoted by capital letters. Thus: 


A B C D 
1111 
2 2 2 2 
3 3 3 3 


N 

1 

2 

3 


nnnnnnnn 

[The same facts are shown in diagrammatic form in Fip. 
2 and 3.] 

“It is logical to suppose (and it can be shown experi- 
mentally*) that knowledge of item or question A1 is much 

iFor example, in standard tests, like the Thmndik^McCall Reading Test, it can be 
shown that the items tKised upon the same reading paragraph are more highly interrelated 
than are items of two different reading paragraphs. Here the paragraphs are analogous 
to the capital letters in the scheme, and the questions or items based on the paragraphs are 
analogous to the 1, 2, 3, etc., falling xmder each capital letter. 
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more likely to guarantee knowledge of items A2 and A3 
than it is to guarantee knowledge of items Bl, C6, and Nn. 
This is merely equivalent to saying that the intercorrelations 
are higher among items of the same column than between 
items drawn from different columns. 





Fig. 2. — Diagrammatic representation of the “extensive*’ sample. The 
dotted portions represent the items or facts actually covered by the ex- 
amination, The numbers and letters are those used in the preceding text. 

“The traditional or intensive type of examination tends to 
the position of including a few (five to ten) columns or topics, 
these being answered in great detail. The newer objective 
examination tends to the extensive sampling, i. e., a few 
narrow items drawn from many columns. 

“A priori, the advantage lies with the extensive sampling 
so far as reliability of sampling is concerned, since such 
samples are not so greatly affected by occasional faulty 
questions, the missing of work due to absence from school, 
and other obvious factors. It might also be pointed out 
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Fig, 3, — Diagrammatic representation of the “intensive’^ sample. The 
dotted portions represent the items or facts actually covered by the ex- 
amination. The numbers and letters are those used in the preceding text. 

that the situation with respect to subjectivity of scoring is 
similar, since the narrow question is less subject to personal 
opinion than the broader type of question.” 

The interrelations of validity and reliability. The attempt 
has been made to approach the concepts of validity and 
reliability from a number of different angles. Much of the 
discussion has been somewhat theoretical. This was done 
intentionally upon the theory that a full understanding of 
the principles of educational measurement requires the 
acquisition of a new vocabulary and the analysis of measure- 
ment into its principal xmderl3dng concepts. The relation 
between validity and reliability may now be pushed some- 
what further in theory. 

Figure 4 shows two vertical scales which represent validity 
and reliability, respectively. Lines are drawn to show the 
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Scale of Reliability 



THE CRITERIA OF A GOOD EXAMINATION 


61 


theoretical iaterrelationships. Solid lines present possible 
relationships. Dotted lines represent impossible conditions. 

Fig. 4 states certain theoretical relationships between 
validity and reliability. The scales of validity and reliability 
are graduated in terms of (correlation) coefficients of validity 
and reliability. 

The following statements are based upon Fig. 4. 

1. Line a represents a possible situation — b. test perfectly 
valid and also perfectly reliable; in fact, such a test could not 
be perfectly valid uffiess also entirely reliable. It need 
hardly be pointed out that such tests ejdst only in theory, 
since all educational measures are both invalid and unre- 
liable in greater or less degree. Line a therefore represents 
a limiting condition. 

2. Line b is also possible, and is occasionally approximated 
in actual practice. The test is totally unreliable and hence 
has no validity. 

3. Lines c, d, e, /, and g Have been drawn to diow that a 
test may be highly (in theory, perfectly) reliable and yet 
have no validity. Such extreme cases do not occur often in 
actual practice, but such a condition mi^t be found when 
reliable tests are grossly misapplied. 

4. Lines t', i", and f" are inserted to show that it is 
possible for the validity of a test to be somewhat greater 
than its reliability, if these are pressed in certain quantita- 
tive terms. Such conditions need not concern us here.'^ 

The preceding discussion is probably sufficient for present 
purposes. The issues raised will gradually become more 
meaningful as later chapters present concrete experimental 

iThis statement means that the validity of a test may be as high as the square root of 
the rdiability of the test. In this case validity represents correlation against a perfewy 
valid cnt^onu See T. L. Kelley, Statistical Method (N. Y.: The Macmillan C^., 1923), 
pp. 205-208. Kdley gives a formula which is at times of the greatest value in discovering 
wbsther it is worth '^toe to attempt to improve the validity of a test through increasmg 
its reliability. , , . 

The serious student of educational measurement may jjerhaijs be curious as to the reason 
why lines t\ and were inserted, in which case see the above reference. 
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studies related to reliability. The treatment of reliability 
may be closed by a brief summary of the points of view 
presented. Chapter XV treats of the statistical determina- 
tion of rdiability by means of coefficients of correlation. 

Summary of concept of reliability. The following srnn- 
marizing statements will review the principal ideas centering 
about reliability: 

1. Reliability is second only to validity in constructing 
educational measurements. 

2. Reliability is that phase of validity which refers to 
the accuracy of a test as a measuring device. 

3. Reliability is presupposed when validity is established. 

4. Reliability is the stability of numericd scores for the 
same individual or individuals when equally difficult and 
similar examinations are applied in sequence. 

5. ReliabEity may not be reduced as greatly as is validity 
when a test is grossly misapplied. 

6. Reliability is guaranteed in two principal ways: (a) 
objectivity of scoring, and (i) extended sampling (length 
of test). 

7. Traditional examinations are seldom highly rdiable 
because personal opinion enters largdy into the evaluating 
of the papers. 

8. Unreliability due to subjectivity can be nearly or 
entirdy eliminated by the use of new-type questions like 
true-false, multiple-choice, completion, matching tests, etc. 

9. Unreliability due to limited sampling is never entirely 
eliminated. 

10. Measurement is never complete; it always represents 
a sampling of abilities. (Even in the case of narrow functions, 
where all of the subject can be included in the test, measure- 
ment still is sampling, on accoimt of the fluctuations in 
psychological factors involved in answering the test.) 

11. Any test score involves greater or less error. 
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12. As long as repetitions of the same test (or equivalent 
tests) show fluctuations in the scores of a pupil, measure- 
ment can not be said to be complete, i, e., stabilized. 

13. Traditional examinations and new-t3T)e objective 
tests differ in their underlying theories of sampling; the for- 
mer have been termed intensive samples and the latter 
extensive samples. 

14. Other things being equal, intensive sampling tends 
to be the more reliable. 

15. Theoretically, a test must be infinitely long in order 
to be perfectly reliable. 

EASE OF ADMINISTRATION AND SCORING 

Need for adequate instructions. Objective tests are less 
familiar to many pupils than are the traditional types of 
examinations. The test exercises present much more com- 
plicated features than the usual essay-type question. Some 
experience and thought are necessary for the concise and 
skillful statement of test instructions so that the dullest 
pupil cannot fail to know what he is expected to do. The 
teacher must remember that an educational test is designed 
to measure achievement, not the understanding of instructions. 
It is true that tests calling for the following of directions are 
sometimes employed in intelligence testing, but such exer- 
cises are exactly what must be avoided in educational 
measurements if valid and reliable results are to be obtained. 
It is never entirely wise to eliminate directions to the pupils 
from tests even if the same types of tests have been employed 
repeatedly. Pupils forget certain cautions and directions 
between times even if they think they remember perfectly. 
The few lines of space required for a statement of instruc- 
tions for a test can hardly be regarded as wasted space. 

A few suggestions are given here for the guidance of 
teachers inexperienced in constructing objective tests. 
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1. The instructions should tell the pupil in simple 
language what he is to do. They should cover such points as 
{a) what mark he should use to designate his answer ( -f, — , 
V , X , underlining, etc.), where to place the answer, what to 
do to correct an answer, etc. 

2. Avoid difficult terms like “encircle,” “underline,” 
“underscore,” etc., in favor of “draw a circle (or ring) 
around,” “draw a line under,” etc. Words like “encircle” 
may be used with high-school pupils, perhaps, but certainly 
not with grammar-school pupils. Even in the case of older 
children there is no reason to risk the use of a term which 
might not be understood. 

3. Give two or three samples, or even more when a new 
type of test is employed. It is a good plan to have one or 
two marked correctly, and then to mark the rest of the 
samples in unison as the teacher directs. 

4. If a new test technique for which it is somewhat difficult 
to phrase concise directions (e. g., matching tests of many 
sorts) is used, give a preliminary test or fore-exercise to 
familiarize the pupils with the mechanics of the test. If 
this is not done, discount the results from the first adminis- 
tration of such a test. 

5. When objective tests are given the first few times, the 
teacher will do well to circulate about the room watching for 
pupils who have failed to understand the directions. These 
should be given enough individual help to get them started 
to work properly. 

6. Instruct the pupils what to do in case they are in doubt 
about a particular item, i. e., whether to leave it out or to 
guess at the answer. (See the evidence on this point in 
Chapter XII.) 

7. Inform the pupils whether they are to work as rapidly 
as possible or to take time to be sure that each answer is 
correct before proceeding to the next item. (Chapter VII 
presents a number of adequate sets of directions to pupils.) 
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Need for economical scoring. The objective test is ad- 
mittedly time-consuming in its construction. Some of this 
time can be saved in the scoring. The traditional test is 
devised rather quickly, but it requires much time for scoring. 
In view of the extra demands upon the teacher’s time, the 
new-type test should be arranged mechanically, before 
mimeographing, so that it can be scored economically by 
means of answer keys or stencils. It is possible to use 
inexp)erienced clerical help or even older pupils for scoring 
objective tests if care has been taken to make the scoring 
simple. 

In Chapter VII a number of devices will be presented 
which will be useful in facilitating scoring when large num- 
bers of tests must be handled. For the present a few sug- 
gestions must suffice. 

1. Checking, underlining, encircling, etc., are more 
economical of scoring time than are words written in. (In 
completion tests it is usually necessary to have the pupil 
write the words which have been left out.) 

2. It is best to arrange the test so that the responses fall 
in vertical columns down the page. One column is better 
than two or more. When the responses form vertical 
coltamns, strips of cardboard bearing the correct answers may 
be placed alongside (preferably at the left) of the pupils’ 
answers. This makes a rapidly scored test. Matching tests 
lend themselves nicely to this arrangement. Simple recall 
tests may be made to do so by aligning the terminal blanks 
in a column. Multiple-choice tests are not so convenient, 
as the responses will occur in irregular positions on the page. 
The same is true of completion tests. 

3. Multiple-response (multiple-choice) tests can be scored 
more rapidly if the method of responding is by number 
instead of underlining. This method must not be used with 
young pupils as there are great dangers of confusion, and 
directions are difficult of phrasing. 
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NORMS OR STANDARDS FOR EVALUATING TEST SCORES 

Norms not essential. Norms have certain values in con- 
nection with the interpretation of standard tests. When, 
however, the teacher constructs her own tests, there are no 
available norms or standards of attainment. Norms would 
add certain facts to the interpretation of objective test 
results, but the expense of deriving such standards would 
not justify the attempt. 

As a matter of fact, the value of norms has been badly 
over-estimated, even in the case of standard tests. ^ Care- 
fully derived norms have unquestionable value, but local 
conditions relative to the course of study, ages of children 
in the different grades, differences in racial and economic 
background, variations in mental ability, etc., make general 
or “blanket” norms uncertain business. 

The constructor of objective tests must seek other means 
of interpretation than through the use of norms. Local 
norms may be derived with the accumulation of records, 
and in the long run, interpretations may be made quite as 
accurate as practical demands suggest. 

DUPLICATE OR EQUIVALENT TEST FORMS 

Definition of equivalent forms. The term equivalent 
forms has been borrowed from the nomenclature of standard 
tests. There are several synonyms which are qrxite common, 
viz., comparable forms, similar forms, duplicate forms, and 
equal forms; the last-mentioned being a looser expression, 
'^en standard tests are prepared, two or more equivalent 
forms are usually constructed. In general, the Imger the 
number of equivalent forms, the greater the usefulness of 
the test. The conditions which must be met in making 

iThe reader interested in the justification of this statement to the effect that norms are 
“over-rated** can find the argument stated in Ruch and Stoddard, Tests and Measuremenis 
in High School Instruction, pp. 60-62. 
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several comparable forms of a test are rather rigid if the 
maximum value is to result from such duplication. We 
can mention the following conditions as, collectively, defining 
the term: 

1. Equivalent forms must show equal average scores 
when applied to large numbers of children. 

2. The spread of scores (range or variability) should be 
the same on all forms. 

3. There should be no duplication of items from one form 
to another; i. e., each form should be an independent 
sampling. However, 

4. AU forms must sample exactly the same function or 
ability. 

5. The correlation (degree of correspondence) should be as 
high as possible between forms. (This is the same as saying 
that each form should be as reliable as possible.) 

If all five conditions were met perfectly, and a large num- 
ber of equivalent forms were given to the same pupils, each 
pupil would receive exactly the same score on each form, 
thus: 


Pupil 

Form 1 

Form 2 

Form 3 

Form 4 

Form 5 

Form 6. . . 

Form iV' 


68 

68 

68 

68 

68 

68 ... 

68 


106 

106 ! 

106 

106 

106 

106 ... 

106 


17 

17 

17 

17 

17 

17 ,, . 

17 




: 






Such exactness never occurs as a matter of fact, for many 
reasons, chiefly unreliability and the gains through such a 
series due to practice effects. 

Values of duplicate forms in objective tests. It must be 
apparent that a dose approximation to the conditions which 
have been laid down for equivalent or similar forms could 
only be obtained by elaborate experimentation. As has 
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been said, the more carefully constructed standard tests are 
published in from two to five or more forms. 

Roughly equivalent forms would serve a number of uses 
in the case of informal objective tests, were such available 
without great expenditures of time and energy. Let us 
consider some of &ese uses. 

1. Every teacher is burdened with make-up examinations 
for absent or failing pupils. Gjnditions usually preclude the 
repetition of the regular examination. An equivalent form 
of the examination would be valuable. If the teacher makes 
up a new test, she can hardly avoid the inequalities of 
difficulty, variability, etc., wMch we discussed at some 
length previously. 

2. There are always a certain number of doubtful cases 
in the results from any examination. Pupils occasionally 
perform very badly on a particular examination in compari- 
son with their general records. Such doubtful cases should 
be re-tested. However, there is often little gain in giving a 
second test not known to be comparable to the first. ^ 

3. Duplicate forms may be distributed in rotation to 
pupils taking examinations and thus prevent cheating since 
no two adjacent pupils receive the same questions. 

4. A very common method of teaching is to lay out the 
work for the class in units (“projects,” “contracts,” etc.). 
Such an organization is an aid to individualized instruction. 
A pupil works on one unit until, in his judgment, he is ready 
to go on to the next. The teacher sometimes tests the pupils 
individually on the unit supposedly mastered. At times the 
pupil must return to that unit for further work before he is 

lit is worth while to digress somewhat at this point and call attention to a phenomenon 
of test scores, which unfortunately is too little appreciated. It has been proved that very 
high test scores are more likely to be in error upward (i. e., too high), and that very low 
scores are more likely to be lower than the truth; i. e., if a number of strictly equivaSent 
forms are given and the results averaged, the average scores tend to move toward the aver- 
age of the group. The expression is that test scores, due to unreliability, regress on the mean 
{average) . This fact was discovered by Sir Francis Galton during a study of the inheritance 
of stature. Regression is a statistical term for the fact that extreme scores are less reliable 
than those falling nearer the middle of the class. 

It is always well to re-test very high and (especially) very low cases. 
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permitted to go on. It is a question here of duplicate forms 
or the less satisfactory plan of repeating the same test one 
or more times. There is more incentive to the pupil to be 
allowed a “fresh” examination. 

5. The last advantage of duplicate forms is perhaps the 
most important, as it gives a certain aid in a persistent and 
perplexing problem in grading school work. The typical 
practice is to change the examinations from semester to 
semester or at least to repeat them only at intervals of 
several years. This practice suffers from a serious limitation, 
it effectually blocks the accumulation of records which would 
generalize and refine our bases for assigning school marks. 

Full discussion of this point is reserved for Chapter XIV 
which deals in some detail with marks and marking systems. 
For the present it is sufficient to point out that the usual 
practice of building examinations anew, year after year, can 
result only in large differences in the difficulty of such ex- 
aminations from time to time. Such differences make it 
almost impossible to teU whether successive classes differ 
m average ability or whether the apparent differences are 
nothing more than variations in the difficulty of the ex- 
aminations. Wen constructed duplicate examinations are 
sufficiently equal in difficulty to enable the teacher to com- 
pare successive classes rather fairly, and, further, to pool 
the test scores of such classes, year by year, arriving finally 
at a reasonably accurate local norm. The use of duplicate 
forms lends itself to a generalization of experience and an 
accumulation of records which will greatly refine grading 
practices. By assuming the duplicate forms to be equal, for 
practical purposes, all pupils over a period of years are 
graded on the same scale of marking. 



CHAPTER HI 


OBJECTIONS TO THE 
TRADITIONAL EXAMINATION 

The objections stated. We have already seen that the 
principal objections to the essay-type examination reduce to 
the question of subjectivity of scoring. There are minor 
objections which have been advanced from time to time, 
some of these being closely related to the matter of unre- 
liability due to lack of objectivity; but others raise questions 
of economy, sampling, etc. The commonest objections are: 

1. Subjectivity of scoring lowers the reliability, 

2. The sampling must be lurdted to a small number of 
broad questions, 

3. The time required to write lengthy answers is excessive. 

4. These examinations encourage bluffing. 

Before undertaking to comment on these reputed limita- 
tions in detail, it will be best to survey a few selected studies 
on the general problem of the reliability of school marks and 
marking systems. 

INVESTIGATIONS OP TEACHERS' MARKS 

Johnson’s investigation. Table 3 is quoted from Kelly’s 
cirrangement of Johnson’s study of marks in the University 
of Chicago High School.^ That the standards of the various 
departments listed in Table 3 show wide variations needs no 
comment. English teachers fail almost three times as many 
pupils as do domestic science teachers and give but half as 
many A’s. A pupil’s chance of getting an A in German is 
approximately twice as great as his getting one in French. 

IF. J. KelW, “Teachers’ Marks,” Teachers College CorUributions to Edtuaiion^ No. 66 
(New York; Columbia University, 1914), p. 11. 

70 



OBJECTIONS TO TRADITIONAL EXAMINATION 71 
TABLE 3 


The Distributions of the Marks of the Several Departments of 
THE University of Chicago High School 
(From Johnson) 


Department 

Total 
No. OF 
Marks 


%OF 

D 

% OF 

c 

V 

% OF 

A 

Greek and Latin 

868 

10.6 

16.1 

31.8 

23.5 

17.9 

German 

416 

8.4 

19.5 

26.4 

28.6 

17.1 

French 

475 

10.9 

18.7 

33.0 

28.0 

9.3 

English 

1514 

15.5 

21.7 

32.8 

23.4 

6.5 

Mathematics 

1466 

14.5 

25.2 

27.6 

21.1 

11.5 

History 

825 

8.1 

15.9 

31.2 

30.0 

14.7 

Science 

672 

8.3 

16.8 

27.7 

32.6 

14.6 

Domestic Science 1 

176 

mm 

fm 

27.3 


13.1 

Average 

(7297) 

11.5 

18.9 

30.6 

27.0 

12.0 ‘ 


Variations in teachers* markings in a large city high school. 
Hendrickson^ has given us another example of the differing 
standards in marking pupils in the several departments of a 
city high school. His tabulation follows: 

TABLE 4 


Distribution of Marks by Departments, Van Nuys, June, 1927 


Department 

A 

% 

B 

% 

c 

% 

1 

D 

% 

E 

% 

Total 
No. OF 
Marks 

Art 

29 

32 

29 

8 

2 

302 

Commercial 

21 

39 

33 

3 

4 

348 

English 

10 

27 

28 

22 

13 

984 

History 

13 

27 

36 

16 

8 

1 710 

Home Economics 

21 

36 

28 

10 

5 

288 

Languages 

19 

32 

24 

7 

18 

323 

Mathematics 

9 

27 

32 

22 

10 

596 

Mechanical Arts i 

26 

45 

20 

6 

3 

462 

Music 

40 

36 

16 

8 

0 

684 

Physical Education 

18 

58 

20 

3 

1 

886 

Science 

22 

33 

30 

9 

6 

469 

All Departments 

20 

36 

26 

11 


6016 

Junior High School 

15 

33 

29 


mm 

2966 

Senior High School 

26 

31 

26 

Wm 

8 

2153 

Academic Departments — 

13 

28 

31 


11 

3046 

Non-academic Departments 

26 

44 

22 

■a 

2 

2970 


iCarl £. Hendrickson: School Marks at Van Nuys High School, Educational Research 
Bulletin, Los Angeles City Schooli Vol. VII, No. 4 (December, 19^, pp. 8-9. 
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Hendrickson’s comment on the foregoing table is both 
concise and adequate to the facts when he says: 

It is probably significant to point out that 56 per cant of the total marks 
were A or B or college recommending and only 18 per cent were D and E. 
In other words the distribution is skewed over toward the highest marks 
considerably. Here it may be of interest to add that all the mental and 
educational tests given in this school result in approximations to the normal 
curve. Also the level of intelligence here is about normal. 

The author’s study of the marks in a small high school. 
Table 5 presents a study of 659 school marks for a six- week 
period in the University of Oregon High School, Eugene, 
Oregon. P. T. refers to the grades of all practice teachers 
grouped together. The totals include re^ar (designated 
A, B, C, etc.) teachers and practice teachers. The sch<x)l 
standards were adopted in faculty meeting as the official 
standards of the school.^ 


TABLE 5 


Letter Grades Assigned by Teachers in the University High Schcxil, 
Eugene, Oregon, for a Six-Week Report Period 


Teacher 

Percentages 

A 


c 

D 

E 

A 

48.7 


6.6 

0.0 

0.0 

B 

28.6 

58.6 

12.8 

0.0 

0.0 

C I 

15.2 

66.3 

15.7 

1.7 

1.1 

D 

41.8 

48.6 

6.7 

2.9 

0.0 

E 

8.9 

78.6 

12.5 

0.0 

0.0 

F 

30.0 

55.0 

15.0 

0.0 

0.0 

P. T 

46.8 

37.0 

8.4 

3.2 

4.5 

Total 

32.0 

54.0 

11.0 

1.6 

1.4 

School Standards 

6.25 

25.0 

37.5 

25.0 

1 6.25 


“The grades as a whole are badly skewed upwards despite 
the existence of a school standarci defining marks thus: 


iQuoted from The Improvement of the Written Examination , pp- 47-48. 
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A to represent the upper 6% of the pupils, approximately, 

B to represent the next 25% of the pupils, approximately, 

C to represent the middle 35%-40% of the pupils, approximately, 

D to represent the next 25% of the pupils, approximately, 

E to represent the lowest 6% of the pupils, approximately. 

“Teacher E illustrates an interesting situation. At the 
previous report period, each teacher’s grade distribution 
together with a graph had been posted in the faculty room. 
This teacher’s summary ^owed more than fifty per cent of 
A grades given. In the effort to overcome this situation, 
this teacher forced down the number of A’s to a point well 
below every other teacher in the school, apparently by the 
very simple expedient of changing the A’s to B’s, a solution 
which did not help matters greatly for the distribution taken 
in its entirety! Could one take these grade distributions 
at face value, he could not but be impressed with the truly 
wonderful efficiency of a school where more than eighty-five 
per cent of the pupils earned either A’s or B’s. Kelly* (after 
extensive study of prevailing marking systems) has sum- 
marized his conclusions in these words: 

A given grade or mark means many widely different things to diJBferent 
teachers when they are rating pupils for promotion. As measured by the 
achievement of the several school groups in their later work this difference 
amounts in some cases to as much as the difference between a G (good) 
and F — (fair minus) in elementary schools where the basis of marking 
includes only the steps P, F, G, and E. In high schools there is enough 
difference between the standards of schools as wholes that, measured by 
the achievement of the school groups in later school work, a mark of 70 
in one school means more than a mark of 81 in another school having the 
same passing standard by points. Within the high school and within the 
college the percentage of pupils which the various instructors fail as a 
common practice extending over several years varies from 0 to 28, or more. 

Conunent on the three inyestigations of teachers’ marks. 
The three studies presented show the same general tenden- 
cies. No significance is to be attached to variations in the 
absolute numbers of different letter grades given in the three 


H)p. cii,, page 133. 
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schools since the bases for distributing marks naturally 
differ as a matter of administrative policy. Within a given 
school, however, there should be a reasonable equality of 
percentages of letter-grades from one department to the next. 

First let us consider what an “A” means. The number of 
A’s which should be given is wholly a matter of definition. 
Each school must settle for itself such questions as a matter 
of administrative policy. There is an idea current that such 
letter-grade distributions are somehow derived from and 
fixed by the normal curve. The normal curve is totally 
impotent in the matter, unless we except the fact that the 
normal curve does suggest the relative proportions of letter- 
grades in contrast with the absolute numbers. To illustrate : 



A 

B 

c 

D 

E 

Case I 

5 

20 

50 

20 

5 

Case II 

10 

25 

50 

10 

5 


Case I is obviously more in harmony with the observed 
facts about the distribution of individual differences and the 
phenomena of organic variation generally. Case II violates 
these facts due to the marked skewness or lack of symmetry 
in its distribution. To this extent the normal curve is a 
rough guide. But, if the question is which case gives the 
proper percentages of A’s (or any other letter-grade), the 
normal curve helps not at all. 

A uniform marking system for the entire United States 
would have far-reaching advantages, but such uniformity is 
little more than a “pious wish.” Each school will probably 
continue, for reasons of greater or less weight, to distribute 
marks according to its locally adopted scheme. The im- 
portant thing is to adopt some arbitrary standard (it must be 
arbitrary), and then adhere to it, department by department and 
teacher by teacher, with the one important qualification that 
such standards are just only in the long run. The validity of 
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such marking plans rests upon large munbers. It must not 
be applied mechanically to small classes. But small classes 
become kirge classes when the results are pooled semester 
after semester and year after year ! Considerable departures 
from the adopted distribution must be permitted to the 
teacher (provided she can prove her point by reliable ob- 
jective evidence) in marking individual classes, but such 
latitude must not be allowed to operate systematically time 
after time. It is possible, but not probable, that the Latin 
teacher might really have had but three per cent of A-pupils 
over a ten-year period, while the civics teacher had seventeen 
per cent in the same time. Such a finding is possible, but 
not typical. There are more probable explanations. 

If four conditions were approximated closely, grades might 
be assigned by the mechanical application of some adopted 
percentage-letter-grade plan. These conditions are: 

1. That very large (theoretically, infinite) numbers of 
pupils are to be graded. 

2. That there is no selective enrollment in different 
classes, sections, etc. (This implies chance assignment to 
classes, absence of ability-grouping, and non-existence of 
selection upon a basis of ability in electing programs of 
study.) 

3. That teachers are equally efficient in their instruction. 

4. That marks are based upon wholly valid and rdiable 
measurements. 

The problem is to decide whether such wide divergencies 
as were diown in the studies of Johnson, Hendrickson, and 
Ruch are fully explained by the factors of small populations, 
sdective enrollment, differences in efficiency of instruction, 
etc., or whether it is more reasonable to suppose that much 
of the departmental variation is to be explained by such facts 
as non-adherence to the accepted school standards, differ- 
ences in the subjective standards of different teachers, etc. 
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The reader can adjudge the relative merits of these rather 
intricate differences in points of view. When, as the author 
found in a study not reported in detail here, one department 
in a large university gave fifteen per cent of A’s and another 
gave .9 of one per cent of A’s, a problem exists. In this 
particular case the former department gave as its alibi the 
statement that superior students were attracted to its 
courses. The intelligence test records of the university were 
consulted. The average of the students of this department 
was well below the average of the university! Taking the 
average marks and the average intelligence ratings by 
departments, only the slightest correlation was found. This 
leads us to suspect that variations in the abilities of pupils 
from one department to the next are probably much too 
small to explain the variations in marks. 

We have a certain t 3 ^ of teacher who has “high stand- 
ards.” She prides herself on having her “sights” high. In 
extreme cases she boasts that she has never given an A. 
A correspondent once wrote the author about a college 
student who grew tired of receiving a constant string of 
C’s. He put the teacher to an (tmderhanded and ungentle- 
manly?) test by copying word for word an exquisite bit of 
literature by a prominent author. This he handed in as his 
own work. He received a C! Such a teacher may have had 
high standards; who can say? Again, it may merely have 
been chronic indigestion or ignorance. 

Adherence to an adopted grading system is nothing more 
than “playing the game fairly.” No one holds such grading 
plans to 1^ more than definitive. Uniformity of practices 
has its obvious advantages. Departures from uniformity 
must be left open, with the qualification that the burden of 
proof is on the one who departs. The essential thing is 
assurance that the underlying evaluations are valid and 
reliable; that the final grades are based upon what has been 
termed a defensible rank-order of abilities. If marks are 
relatively fair, the exact final distribution reduces to a pure 
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matter of definition of school policy. Tests and exaTninat ion, 
if well-constructed, form a good basis for both adherence to 
and departure from the defined standards. Personal judg- 
ments also have their values, but tests can be made to 
discriminate finer differences than can unaided subjective 
estimates. 

INVESTIGATIONS OF REGRADINGS OF THE SAME PAPERS 

The studies of Starch and Elliott. The pioneer work in 
this field was done by Starch and Elliott, ‘ who submitted 
exact copies of the same examination papers to a large num- 
ber of teachers. Figure 5 shows the marks of 142 English 
teachers on the same examination in English; Figure 6 
shows the variations observed when 115 teachers graded the 
same paper in geometry. 
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Fig. 5. — ^The marks assigned to the same English paper by 142 teachers 
of English (after Starch). 
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28 51 55 60 65 70 73 30 85 90 

Fig. 6.— The marks assigned to the same paper in geometry by 115 
teachers of high-school mathematics (after Starch). 


^School Review, Vol. XX: pp. 442-57; VoL XXI: pp. 254-59; Vol. XXVI: pp. 676-81. 



78 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

It had previously been supposed that marks in mathe- 
matics cotild be much more objectively given than for such 
subjects as English and history. Rather strangely it turned 
out that the English papers were graded with somewhat 
greater uniformity than those in history and mathematics 
in this investigation. Later, Starch found that teachers 
within the same school show almost as great differences in 
their markings as do teachers selected from different schools. 

Rack’s repetition of the studies of Starch and Elliott.^ 

“The first experiment was that of submitting three answers 
to the same question in geography to about a hundred teach- 
ers for regrading. All three answers were taken from the 
papers of one class, those papers selected being the best, 
poorest, and median papers. The pupils’ refuses were 
copied verbatim and mimeographed, retaining all errors. 
The instructions to the teachers and the answers are re- 
produced below. 

Directions: Below are three actual answers to the question: Name 
and locate five of the largest cities of the United States and name their leading 
industries, exports, and imports. 

Grade each of the three answers on a scale of 0 to 20, according to your 
best judgment of its merit, 20 being an answer ordinarily accepted by 
teachers as entirely satisfactory, and 0 being an answer practically without 
discernible merit. 


Answer 1 

Five of the largest cities in the United States is Detroite. An export is 
Cars. And industry is Manufactoring. Chicago is an important city and 
an export is Manufactored and canned goods. An industry of Chicago is 
meate packing. New York is another important dty. An industry of 
N. Y. is manufactoring. An export of N. Y. is manufactored goods. 
Pittsbuige is an important city of U. S. An export is iron ore. A industry 
of Pittsburge is manufactoring. Another important city of U. S, is New 
Orleans. An export of New Orleans is cotton. An industry of New Or- 
leans is manufactoring. 

Grade 


iQuoted with slight changes from The Improvement of the WriUen Examination, pp. 55-60. 
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Answer 2 

The five largest cities of the United States are (1) navada (2) Arkansas. 
The leading industry of navada is manufacture and the leading industry 
of Arkansas is agriculture. The leading imports are manufacturing mostly. 

Grade 

Answer 3 

The 5 largest cities of United States are New York, Chicago, St. Louis 
Boston, San Francisco, New York is in the State of New York. Chicago 
is in the State of Illinois. St. Louis in in the state of MissourrL Bostin 
is in the state of Mass. San Fransisco is in the state of California. New 
York is a mnufactuiing dty. Chicago is noted for meat packing center. 
St. Louis is noted for manufacturing textile goods and iron goods. Boston 
is noted for manufacturing of textile and iron goods. San Fransisco is 
noted for the packing of fruit New York exports iron goods and imports 
wool, cotton, and other raw materials, Chicago exports meat and hides 
and grain and imports food and grains. St. Louis exports manufactured 
products and imports raw materials. 

Grade 

‘‘Table 6 on the next page summarizes the regradings of 
the three answers by ninety-one teachers. The facts illus- 
trate several points previously brought out. In the first 
place it is to be noted that the original grading in no case 
agrees at all closely with the average grades of the ninety-one 
teachers. The difference is from two to five points in each 
case. The ninety-one teachers show a wide variance of 
opinion about the merits of these three answers. Which set 
of marks is correct, the originals or the regradings? 

“The only answer is to accept the average of the group as 
approximating the truth. Granting this for the moment, 
we find that the original mark of Answer 1 was about five 
points too high, and that for Answer 3 was about four points 
too low. What reason can be assigned for this situation? 
So far as the facts presented in Table 6 go, it is impossible to 
answer this question satisfactorily. A probable explanation 
is to be found in the fact that Answer 1 was taken from a 
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rather superior paper, but Answer 3 was found in a decidedly 
inferior paper, the papers taken as a whole. The original 
grader has undoubtedly been unconsciously influenced by 
these fects; thus tending to grade leniently a poor answer in 
an otherwise very good paper, and underestimating the value 

TABLE 6 


Original Marks and Marks Assigned by 91 Teachers Who Regraded 
Three Answers to a Question in Geography 


Mark 

Answer 1 

Answer 2 

Answer 3 

20 

1 


9 

19 

0 


3 

18 

1 


21 

17 

1 


12 

16 

2 


17 

15 

10 


15 

14 

2 


3 

13 

8 


1 

12 

14 


3 

11 

5 


1 

10 

24 


3 

9 

5 


1 

8 

7 


1 

7 

3 


0 

6 

4 


0 

5 

1 


0 

4 

0 


0 

3 

2 


1 

2 

1 

3 

0 

1 

0 

1 

0 

0 

0 

87 

0 

Number 

91 

91 

91 

Mean 

10.9 

0.1 

16.1 

Standard Deviation 

3.2 

0.4 

2.9 

Original Mark 

16 

2 

12 


of a good answer which formed a part of an inferior paper. 
This same fact was evident many times in these stupes of 
examination papers, and if space permitted, considerable 
statistical proof of the operation of such unconscious biases 
could be presented. Siinilarly, the existence of systematic 
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tendencies for certain teachers to grade too leniently and 
for others to be too harsh in their judgments might be amply 
demonstrated by facts at hand. These phenomena are too 
well recognized by teachers, however, to require proof or 
comment. 

“In a second experiment, three entire history papers taken 
from a seventh-grade class in American history were mimeo- 
graphed so as to preserve all spelling and grammatical errors, 
and as much of the mechanical form as was possible in the 
duplication. The papers were again the best, median, and 
poorest papers of the class. These were then regraded by 
115 teachers, independently. The original marks were, of 
course, not known to the group. Space does not permit the 
recording of either the questions or the copies submitted to 
the teachers. Table 7 on the next page summarizes the facts 
of this experiment, which was similar in nature to those 
formerly reported by Starch and Elliott. 

“Table 7 would seem to show that the original grader of 
these papers was too tolerant in her standards, the composite 
judgments lowering all of the original marks by at least ten 
points. This illustrates the operation of systematic biases 
which have been asserted as a source of error. Marked 
variability exists as before in the judgments of the 115 
teachers as to the worth of these papers. In this case, how- 
ever, the ranks of the papers remain the same as before, so 
that there is little doubt that distinguishable differences in 
merit are present among the three examinations. It should 
be recalled that these three papers are as widely spaced 
among the class as was possible, the best, the median, and 
the poorest paper being selected. Attention should also be 
called to the amount of overlapping present, a fact that is 
more graphically diown in Figure 7 on page 83. Only the 
average of a very large number of teachers’ markings would 
demonstrate decisively that real differences in merit are 
present in these papers. 
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TABLE 7 


Original Marks and Marks Assigned by 115 Teachers Who Regraded 
Three Papers from a Class in American History 


Mark 

Paper 1 

Paper 2 

Paper 3 

100 

6 



95 

33 



90 

32 

i 


85 

22 

12 

1 

80 

15 

13 

4 

75 

6 

29 

9 

70 

1 

18 

5 

65 


16 

20 

60 


12 

16 

55 


9 

19 

50 


3 

13 

45 


2 

8 

40 



14 

35 



5 

30 



0 

25 



1 

Number 

Mean 

Standard Deviation . . . . 

Original Mark 

115 

88.7 

6.6 

100 

1 115 

70.3 

9.9 

88 

115 

56.6 

12.3 

67 


Wood’s studies of the examinatious of the College En* 
trance Examination Board. The following paragraphs are 
quoted directly from Wood.* 

“More recent evidence of the inadequacies of the marks 
derived by the traditional examinations was brought out 
in a study of the algebra and geometry examinations of 
the College Entrance Examination Board for June, 1921. 
About four hundred algebra and an equal number of geome- 
try papers selected at random were scored each twice inde- 
pendently by two different readers of the Board. The 
correlations between the first and second scorings were 
very high for both algebra and geometry, about .98 and .96. 


D. Wood, Measurement in Higher Education (Yonjcers-on-Hudson, New York* 
World Book Company, 1923), pp. 124-125. 












PAPER 1 ( BEST j 
PAPER Z (median) 



history papers in Table 7, 
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In spite of this almost perfect objectivity in the scoring, 
however, the reliabilities of the examinations themselves 
were very low. The correlation between random halves 
of the algebra examination was found to be only .61, and 
that between random halves of the geometry examination 
only .41. By the use of Brown’s formula, the reliability of 
the whole algebra examination is estimated as not greater 
tban ,76, and that of the geometry examination as not 
greata: than .58. 

“The m eaning of these reliability coefficients may be made 
clearer by a consideration of a hypothetical case closely 
resembling the actual situation faced by the Board in giving 
college entrance examinations. 

“Let us suppose that ten thousand candidates are tested 
with Form A of a given geometry examination whose re- 
liability is about .60, and that thirty per cent of the ten 
thousand fail. Now let us suppose that the same ten 
thousand are tested with another equivalent geometry 
examination, say Form B, whose reliability is also .60, and 
which fails thirty per cent of the candidates. 

“If the reliability of the two forms of the examination 
were 1.00, the same three thousand candidates would be 
failed by both forms; but with a reliabihty of only .60, the 
agreement on failures would be as follows in gross numbers: 


Failed Form A 

Passed on 

Passed Form B 

both Forms 

1279 

5721 

Failed on 

Passed Form A 

bothFonns 

Failed Form B 

1721 

1279 


“In other words, accepting the results of one such exami- 
nation as valid, which was done by the Board in 1921, 
another equivalent examination would pass 1279 of the 
3000 failed by the first, and would fail 1279 of the 7000 
passed on into college by the first. 
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“If we assxime that fifty per cent were failed by each of- 
the two forms, the displacements in gross numbers would be: 


1015 

3985 

3985 

1015 


The fate of two thousand in ten thousand candidates would 
be reversed by an equivalent form of the same examination 
when the reliability is no higher than that of the C.E.E.B. 
Mathematics C Examination for June, 1921.” 

Another view. Recently Bolton has boldly denied that 
teachers show marked lack of uniformity in marking papers.^ 
His evidence is based upon an investigation which he con- 
ducted together with a re-examination of a minor study of 
Starch (not the larger ones previously reported in this 
volume). Bolton’s experiment was very well conceived 
except in one or two respects noted below. His statistical 
arguments are not nearly so fortunate. 

Bolton had a number of sixth-grade arithmetic teachers 
make examinations and administer them. From the papers 
he selected by a sampling process the results for twenty-four 
pupils. These papers were then graded by twenty-two 
teachers. This procedure is admittedly sound save in the 
very important respect that Bolton selected what is probably 
the second most highly objective school subject (arithmetic) for 
his investigation. Starch, it will be recced, chose highly 
subjective school subjects in the main. So did most of 
those .who repeated the work of Starch. Bolton’s point of 
view which guided his set-up for the investigation can best 
be expressed in his own words. Speaking of the type of 
teadiers used by previous investigators, he says (p. 24): 

They vary in ejqjerience; their everyday work may vary from teaching 
beginners to read to administering a school system with a hundred teachers; 

iP. E. Bolton^ ‘*Do Teachers* Marks Vary as Much as is Supposed?** Education, Vok 
XLVin ( 1927 ), pp. 23 - 38 . 
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some teach one subject, some many others; some have had real professional 
training, some absolutely none. Possibly not one tenth [italics mine] of 
those marking the papers have had experience in marking papers in that 
subject, and many are so rusty in the facts of that subject as not to know 
the answers to the questions themselves. 

While it is possible that Bolton’s statements may be true 
in some cases, the assertion that nine-tenths of the teachers 
taking part in the experiments of Starch (and the present 
author) were incompetent is gross misrepresentation. Starch 
used 142 English and 115 geometry teachers from the 
North Central Association selected under instructions to 
have the grading done by the “principal teacher” of the 
subject. The total of 206 teachers used by the author in 
gratog the geography and history replies of Tables 6 and 7 
were all selected upon the basis of experience and training, 
and all had more thein average professional training. 

The main objection to Bolton’s procedure and his conclu- 
sions rests in his choice of statistical methods. He averaged 
the marks of the twenty-two teachers for each of the twenty- 
four papers, obtaining the values in column (c) of Table 8. 
He next found the average of the deviations about such 
averages, column (/). The other columns are the selections 
of the present author from Bolton’s Table I. 

There is of course no objection to the use of averages and 
average deviations from such averages as a statistical pro- 
cedure unless we question the choice of such a method of 
interpretation. After all, the range between the highest 
and lowest marks given an individual paper may be the 
important thing, and not the fact that the average deviations 
about the averages of twenty-two teachers is fairly small. 

Now it must be admitted that Pupil 12 was fairly marked, 
for all practical purposes, by any one of the twenty-two 
teachers. In fact, he is the only one in the whole lot about 
whom such an assertion may ^ made unqualifiedly. But 
how about Pupils 5, 6, 13, 17, 19, 21, and 23, not to mention 
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TABLE 8 


Selected Data From Bolton’s Study of Teachers’ Marks 


(fl) 

(fr) 

(t) 

id) 

(«) 

CO 

Pupil 

I/>WEST 

Mark 

Assigned 

Highest 

Mark 

Assigned 


Average of 
22 Teachers 

Average op 
Deviations 

1 

77.5 

100 

22.5 

88.7 

3.6 

2 

75 

90 

15 

85.0 

4.3 

3 

70 

95 

25 

88.7 

3.5 

4 

43 

71 

28 

57.8 

5.9 

5 

65 

95 

30 

84.6 

6.4 

6 

25 

70 

45 

51.0 

10.5 

7 

74 

91 

17 

84.8 

3.1 

8 

73 

93 

20 

84.7 

5.4 

9 

85 

100 

15 

93.8 

2.7 

10 

71 

95 

24 

89.6 

3.4 

11 

66 

80 

14 

77.5 

2.8 

12 

84 

90 

6 

88.0 

1.4 

13 ! 

46 

85 

39 

71.5 

7.8 

14 1 

67 

84 

17 

74.4 

2.3 

15 

47 

74 

27 

62.7 

4.6 

16 

82 

98 

16 

91.5 

3.9 

17 

53 

95 

42 

78.9 

8.4 

18 

70 

91 

21 

77.9 

4.4 

19 

37.5 

78 

40.5 

53.5 

7.2 

20 

43 

71 

28 

55.3 

4.0 

21 

48 

88 

40 

74.5 

8.5 

22 

35 

58 

23 

45.1 

4.7 

23 

43 

75 

32 

65.3 

7.3 

24 

17,5 

47 

29.5 

27.2 

5.8 

Medians 

65.5 

89 

24.5 

77.7 

4.5 


certain others? If Pupil 6 was graded all the way from 25 
to 70 with an average of 51, just what are we to conclude 
about his ability? The median of the ranges is roughly 
twenty-five ix>ints. In about half the cases the most lenient 
teacher marked from twenty-five to forty-five points higher 
than the most severe one. 

Do such data support the statement, “A glance at the 
distribution ... of variations from the average discredits 
entirely [italics mine] the assertion that there is no uni- 
formity of marks” (p. 28)? 
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Starch, for much more subjective subjects than arithmetic, 
reported ranges among the marks of more than one hundred 
teachers as follows: 


Subject 

Range 

No. OF Teachers 

English. 

50 to 98 (48 points) 
28 to 92 (64 points) 

142 

Geometry 

115 



The author found results for three history papers as 
follows: 


Paper 

Range 

No. OF Teachers 

! 

1 

70 to 100 (30 points) 
45 to 90 (45 points) 
25 to 85 (60 points) 

115 

2 

115 

3 

115 



Taking all facts into consideration, the following state- 
ments are given with the hope that the reader will evaluate 
each and arrive at some decision as to the validity of Bolton’s 
refutation of Starch and others: 

1. Bolton dealt with a reasonably objective school sub- 
ject; spelling being, perhaps, the only less subjective ele- 
mentary school ability. 

2. He used a much smaller number of teachers (22), thus 
possibly reducing the variability considerably. 

3. He shifted the argument from the idea of extreme differ- 
ences (ranges) to deviations about an average. This is 
defensible, of course, speaking purely statistically, but the 
fact remains that his interpretation is not comparable to 
that of Starch. When comparable treatments are made, the 
differences between Bolton and Starch are not so very great. 
This leaves us squarely with the question whether the 
important thing is what the average teacher does or what 
the extremes of individual teachers do. Take Pupil 10 of 
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Table 8, who represents about the average situation found 
by Bolton- One teacher gave him 95, another 71. The 
former grade would probably be an A and the latter a D or 
an E on a five-point scale. The blunt fact remains that it 
makes a lot of difference to that pupil which school he 
attended and what teacher he “drew.” With the admitted 
exception of Pupil 12, and possibly (to be generous) Pupils 
2, 7, 9, 11, 14, and 16, the other seventeen have a right to 
“kick” about the situation. And, Pupils 5, 6, 13, 17, 19, 21, 
and 23 have the moral right to riot and insurrection. They 
are failures or successes depending upon their teachers, 
regardless of averages and average deviations. 

The reader must judge for himself whether or not Bolton 
has made his point. 

STUDIES ON THE RELIABILITY COEFFICIENTS OP 
EXAMINATIONS 

Introduction. Reliability has aheady been discussed and 
defined in several ways (Chapter 11). The quantitative 
expression of the degree of reliability is usually made in 
terms of coefficients of correlation. Correlation is, as the 
term itself suggests, co-relation or the degree of correspond- 
ence between two series of munerical values. The mathe- 
matics of the computation of coeflffcients of correlation will 
be reserved for Chapter XV, which discusses the elementary 
statistical methods related to examination practices. For 
the present it will serve our purposes to know the general 
significance of the “reliability coefficient.” 

When two sets of measures of the same ability or function are 
correlated, we term the resulting coefficient of correlation a 
reliability coefficient. By “two sets of measures of the same 
ability or function” we ^ve in mind equivalent or compara- 
ble forms of the same test, or some closely analogous pair 
of measures. 
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If a teacher gives two forms of a standard test or if she 
administers two duplicate examinations, the two sets of 
scores may be compared by correlation, the resulting coeffi- 
cient in this case being a reliability coefficient. 

There are several ways of obtaining reliability coefficients 
when we are studying examinations: 

1. Two equivalent (or roughly equivalent) tests may be 
given and the results correlated. 

2. A single test may be given, the papers graded inde- 
pendently by two teachers, and the two sets of marks are 
then correlated. 

3. A single examination, graded by a single person, may 
be broken into two half-examinations by some chance 
method (e. g., taking alternate items in each half-form), 
and the halves are then correlated. (This gives directly the 
reliability of half the examination. The reliability of the 
whole examination can be estimated rather accurately by 
the use of the appropriate mathematical formula.) ‘ 

These three methods are not exactly comparable in mean- 
ing, but each has its distinct uses in attacking the question 
of the reliability of examinations. 

“High” and “low” reliability. When is a correlation 
“high” or “low”? There are as many answers as there are 
textbooks on statistics and measurement. The question is 
far too intricate for full discussion. Instead, we shall beg 
the issue by laying down dogmatic statements which will 
define the author’s point of view for purposes of present 
interpretations. 

Correlations of 0.00-0.25 are insignificant. 

Correlations of 0.25-0.50 are low. 

Correlations of 0.50-0.80 are fairly significant. 

Correlations of 0.80-0.95 are fairly high. 

Correlations of 0.95-1.00 are high. 

^See Chapter XV for a discussion of the use of the Speannan-Brown fonnula in this 
:onnectk>n. 



OBJECTIONS TO TRADITIONAL EXAMINATION 


91 


TABLE 9 


Summary Distribution of 
Coefficients of Reliability 
FOR Written Examinations 


Size of Coefficient 
OF Correlation 

Frequency 

.95 

1 

.90 

2 

.85 

4 

.80 

4 

.75 

9 

.70 

4 

.65 

9 

.60 

8 

.55 

4 

.50 

4 

.45 

5 

.40 

2 

.35 

1 

.30 

4 

.25 

1 

.20 

0 

.15 

1 

.10 

0 

.05.. .. 

1 

.00 

0 

-.05 

0 

-.10 

0 

-.15 

1 

-.20 

1 

Total 

Median .65 

66 


These statements are some- 
thing of a compromise be- 
tween strict statistical con- 
siderations and the more prac- 
tical question of the frequency 
with which tests and measure- 
ments attain these several 
levels of magnitude. In gen- 
eral, the present interpretation 
is more conservative than that 
found in most textbooks. 

Monroe and Sonders’s study 
of the reliabilhy of written ex- 
aminations (traditional type). 

Monroe and Souders com- 
puted reliability coefficients 
for sixty-six examinations. 
The same pupils were given 
two examinations; “in most 
cases the questions were pre- 
pared and the papers marked 
by different teachers.” Table 
9 is reproduced from Monroe. ^ 
The range of reliability co- 
efficients was from —0.20 to 
0.95; the median being 0.65.* 


McGregor and Bnch’s study of state dghth-grade ex- 
aminations.* “Requests were sent to all state superinten- 
dents of public instruction for copies of all official state 
eighth-grade examinations for as many years past as possible. 


iW. S. Monroe, X C. DeVoes and F. J. Kelly, Educaiional Ttsts and Measurements 
(Revised editioD, Boston: Houghton MifiSin Co., 1^), p. 471. 

Negative corFelatx>ns are interpreted in exactly the same waya as poBitive ones except 
that the relationships show tendencies to be inverse; i. e., the pa|^ doing well on one 
examination tended to do poorly on the other; at least, the sm^ amount of correlation 
noted in such negative aituationa was of this inverse sort. 

K>u 0 ted with minor diangea from Ruch, ei dU Ohjectite Examination Methods in the 
SocieU Studies (Chk^: Scott, Foreaman and Co., 1926), pp. &-12. 
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Eleven states responded with questions which were actually 
used in this investigation. A few other states delayed their 
returns until too late for inclusion. 

“The examination questions from the eleven states were 
classified in three groups : United States history, geography, 
and civics (citizenship). Key numbers were assigned to the 
examinations in order that the source of the questions would 
not be revealed to the schools co-operating in the experiment. 
These key numbers are used in the tabulations to be pre- 
sented in this chapter. Every attempt was naade to avoid 
any publicity about the particular states furnishing the 
questions, since it was the intention of the invesitgators to 
study the eighth-grade examination system as a whole 
rather than to direct attention to the examination practices 
of particular states. Parenthetically, it may be said that no 
evidence was found which suggested that any of the individ- 
ual states were measurably superior or inferior to the others 
in the character of the examinations set for their eighth- 
grade pupils. 

“Occasional questions were omitted when such questions 
were based upon local history, geography, or government, 
for two reasons: (1) in order not to reveal the source of 
the examination, and (2) because the inclusion of such 
questions would not be a valid procedure in view of the 
fact that the questions would be used in states other than 
the one for wMch the exzimination was devised. Since the 
examinations usually offered some degree of choice in the 
questions to be answered, it was possible to omit an oc- 
casional question without much violence to the examination. 

“The sets of questions were then mimeographed so that 
each pupil might have his individual copy. The following 
directions were given to the pupils: 

You are to be given an examination in (the 

teacher supplied the subject), which pupils in another state had to take 



OBJECTIONS TO TRADITIONAL EXAMINATION 93 

in order to get their eighth-grade diplomas. We want to see if you can do 
as well as pupils in other states. Work as fast as you can without making 
mistakes. When you have finished, record your time in a square which 
you should make on the last page near the bottom.” 

“The teacher timed the examinations to the nearest one- 
half minute by means of the plan of writing the elapsed 
time at half-minute intervals on the blackboard. All of the 
pupils used in the experiment wrote on two examinations 
for the same subject, viz., the set of questions for the year 
1923 and the set for 1924. Thirty-two experienced teachers 
of the social studies did the scoring, every paper being marked 
independenily by two teachers. 

“That the investigation included a wide sampling of state 
examinations, pupils, and scorers is shown by the following 
facts: 

(1) The eighth-grade examinations were drawn from 
eleven different states. 

(2) Thirty-two different sets of questions were used, i. e., 
both the 1923 and 1924 questions for sixteen school subjects. 

(3) Thirty-two different teachers read the papers, each 
teacher reading the 1923 and 1924 examinations for one 
class of pupils. 

(4) Tbe papers include two examinations each from 952 
pupils representing 15 schools and 11 states. 

“All papers were graded upon a basis of 100%. If the 
examination included ten questions, each question was al- 
lowed a maximum of ten points. Where five, ei^t, etc., 
questions were employed, the 100 points were divided evenly 
among the questions. 

“Treatment of the results. The sixteen examinations per- 
mitted the calculation of a total number of ninety-six correla- 
tions, each correlation being a reliability coefficient from some 
point of view. The six correlations possible for each set of 
examinations may be shown by the foUovnng outline: 
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(1) 1923 eicamination: scorer No. 1 vs. scorer No. 2. 

(2) 1924 examination: scorer No. 1 vs. scorer No. 2. 

(3) Scorer No. 1: 1923 examination vs. 1924 examination. 

(4) Scorer No. 2: 1923 examination vs. 1924 examination. 

(5) 1923 examination scored by No. 1 vs. 1924 examination scored by 
No. 2. 

(6) 1924 examination scored by No. 1 vs. 1923 examination scored by 
No. 2. 

‘"The six numbered columns of Table 10 correspond to the 
above numbering scheme. Table 10 presents the ninety-six 
reliability coefficients possible for the sixteen sets of ex- 
aminations. 


TABLE 10* 


Reliability Coecticients of 16 State Diploma Examinations, Year 
(1923) against Year (1924) and Scorer against Scorer 


No. 

Key 

Subject 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Pop, 

1 

G-2 

Ele. Citizenship . 

’.45 

.21 

-.05 

-46 

.34 

-.26 

102 

2 

M 

U.S. History... 

.60 

.43 

.16 

.41 

.23 

.17 

31 

3 

j-i 

U. S. History . . . 

.47 

.30 

.44 

.73 

.25 

.22 

32 

4 

F-3 

Geography 

.58 

.39 

.37 

.22 

.17 

.55 

36 

5 

F-1 

U, S. History . . . 

.89 

.99 

.67 

.64 

.67 

.69 

94 

6 

D-2 

Civics 

.82 

.88 

.22 

.25 

.33 

.23 

36 

7 

CO 

Geography 

.40 

.88 

,32 

.48 

.29 

.41 

32 

8 

M-2 

Civics 

.80 

.82 

.47 

.55 

.46 

.52 

61 

9 

D-1 

U. S. History . . . 

.81 

.22 

.54 

.65 

.49 

.35 

34 

10 

L-1 

U. S. History . . . 

.79 

.57 

.73 

.48 

.45 

.66 

107 

11 

A-l 

U. S. History . . . 

.81 

.85 

.66 

.71 

.56 

.65 

42 

12 

B-1 

U. S. History . . . 

.53 

.58 

.36 

.34 

.27 

.41 

97 

13 

B-2 

Civics 

.63 

.20 

.36 

-.18 

-.06 

.25 

82 

14 

K-1 

U. S. History... 

.93 

.91 

.56 

.67 

.68 

.51 

99 

15 

1-2 

Civics 

.81 

.53 

.37 

.27 

.52 

.46 

35 

16 

E-1 

U. S. HQstory . , . 

.75 

.12 

.26 

.59 

.60 

.19 

32 

Averages 

.69 

.56 

.40 

.45 

.39 

.38 

(952) 

Averages by Pairs of 








Columns. 


.62 

.43 

.38 



1923 examination, scorer No. 1 vs. scor^ No. 2. 

1924 examination, scorer No. 1 vs. scorer No. 2. 

Scorer No. 1, 1923 examination vs. 1924 examination. 

Scorer No. 2, 1923 examination vs. 1924 examination. 

1923 e x a min ation scored by No. 1 vs. the 1924 examination scored by 


♦CoLtJMN (1): : 

Column (2) ; 

Column (3) 

Column (4); 

Column (5): 

No- 2. 

Column (6): 1924 examination scored by No. 1 vs. the 1923 examination scored by 
No. 2. 
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“Table 11 shows the average scores or mar ks assigned to 
both 1923 and 1924 examinations of the sixteen state diploma 
examinations. 

TABLE 11 


Average Scores (Marks) Assigned by Two Different Scorers for 
Both the 1923 and 1924 Examinations (the 16 State Diploma 
Examinations) 


No. 

Key No. 

Scorer 1 

1923^^3Cam. 

ScmER2 

1924^1^9Cam. 
Scorer 1 

1924^i^XAM- 
SCORER 2 

1 

G-2 

67.5 

82.0 

73.0 

70.4 

2 

I-l 

71.7 

67.1 

57.0 

71.0 

3 

j-i 

68.1 

43.6 

64.9 

45.6 

4 

F-3 

70.0 

54.8 

70.3 

69.9 

5 

F-1 

47.7 

45.3 

51.4 

41.9 

6 

D-2 

55.7 

47.5 

65.6 

59.1 

7 

1-3 

51.0 

62.3 

50.4 

68.9 

8 

M-2 

51.0 

48.6 

48.5 

42.9 

9 

D-1 

38.3 

34.4 

48.5 

30.7 

10 

L-1 

49.3 

56.5 

38.3 

65.3 

11 

A-1 

42.5 

28.1 

25.3 

18.6 

12 

B-1 

14.4 1 

7.7 

24.4 

25.3 

13 

B-2 

29.3 

26.0 

24.8 

11.5 

14 

K-1 

48.1 

59.0 

61.3 

64.7 

15 

1-2 

38.3 

41.4 

68.1 

58.0 

16 

E-1 

21.0 

26.9 

8.6 

12.4 


Summary of Differences* 



(1-2) 

(^) 

(1-3) 

(2-t) 

(1-4) 

(2-3) 

Average Difference — 

8,6 

9.4 

9.2 

9.9 

11.4 

12.0 

Largest Difference 

24.5 

17.6 

29.8 

27.0 

23.9 

26.8 

Sm^est Difference . . . 

2.4 

2.0 

0.3 

0.4 

0.1 

0.2 


“Table 12 shows the differences in the average scores (ot 
Table 11 assigned by two different scorers for both forms 
of the 16 state diploma examinations. Algebraic signs are 
ignored. The outline preceding Table 12 is necessary in 
interpreting the meanings of the columns lettered (a), {b), 
(c), etc., in Table 12.” 

(3-4), etc.» refer to the differences in the columns numbered 1. 2, 3, and 4. 
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(a) Differences in the average scores assigned to the 1923 and 1924 ex- 

aminations by scorer No. 1. 

(b) Differences in the average scores assigned to the 1923 and 1924 ex- 

aminations by scorer No. 2. 

(c) Differences in the average scores assigned to the 1923 examinations 

by scorers Nos. 1 and 2. 

(d) Differences in the average scores assigned to the 1924 examination 

by scorers Nos. 1 and 2. 

(e) Differences in the average scores assigned when scorer No. 1 read the 

1923 examination and scorer No. 2 read the 1924 examination. 

(J) Differences in the average scores assigned when scorer No. 1 read the 

1924 examination and scorer No. 2 read the 1923 examination. 


TABLE 12 


Differences in the Average Scores (of Table 11) Assigned by Two 
Different Scorei^ for Both Forms of the 16 State Diploma 
Examinations Algebraic Signs are Ignored.* 


No. 

Key No. 

m 

ib) 

ic) 

id) 

(e) 

if) 

1 

G-2 

5.5 

11.6 

14.6 

2.5 

2.9 

9.1 

2 

M 

14.7 

3.9 

4.6 

14.0 

0.7 

10.1 

3 

j-i 

3.2 

2.0 

24.5 

19.3 

22.5 

21.4 

4 

F-3 

0.3 

15,1 

15.2 

0.4 

0.1 

15.5 

5 

F-1 

3.6 

3.3 

2.5 

9.4 

5.8 

6.1 

6 

D-2 

9.9 

11.6 

8.2 

6.5 

3.4 

18.1 

7 

1-3 

0.6 

6.6 

11.3 

18.5 

17.9 

11.9 

8 

M-2 

2.5 

5.7 

2.4 

5.6 

8.1 

0.2 

9 

D-1 


3.7 

4.0 

17.8 

7.6 

14.1 

10 

L-1 


8.8 

7.2 

27.0 

16.0 

18.1 

11 

A-1 


9.4 

14.4 


23.9 

2.3 

12 

B-1 


17.7 

6.7 

■EEB 

10.9 

16.8 

13 

B-2 

4.5 

14.5 

3.3 


17.8 

1.2 

14 

K-1 i 

13.2 

5.8 

10.8 


16.6 

2.4 

15 

1-2 

29.8 

16.6 

3.0 

10.1 

19.7 

26.8 

16 1 

E-1 

12.3 

14.5 

5.9 

3.7 

8.6 

18.2 

Averages 

9.2 

9.4 

8.6 

9.9 

11.4 

12.0 

Averages by 







pairs of 







columns 

9.3 1 

! 9.2 

11.7 


•See Table 10 for subjects involved and numbers of cases. 

tSee comments above tbe table for description of columns lettered, ( 0 ), (b), (c), etc. 
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Gordon’s study of the New York Regents' Examinations.^ 
Gordon, working under the direction of the author, carried 
on an investigation similar to that reported by McGregor 
and Ruch except that examinations prepared by the New 
York Regents were employed. Table 13 shows certain of 
the findings. 

It should be noted that the studies of McGregor and Ruch 
and Gordon used examinations constructed for use in one 
state but which were applied to pupils in other states. This 
resulted in lowered average scores, and very possibly in 
somewhat reduced reliability coefficients. This irregularity 
can hardly be held to invalidate the results completely. 
It was further true that the regularly appointed official 
readers were not used in these two investigations.* 

Wood’s studies on old-type test reliabilities. Dr. Ben D. 
Wood has been a most indefetigable investigator of the 
comparative validities and reliabilities of both old- and 
new-type examinations. His studies have taken many 
different directions, and unfortunately he has not found time 
to summarize his findings in any one reference work. A 
series of investigations has attacked in turn the examinations 
of Columbia University, those of the New York Regents, 
those of the College Entrance Examination Board, and 
certain tmoffidal and individual examinations. On the 
points at issue in this chapter. Wood is perhaps the out- 
standing authority; certainly he has been the most prolific 
and consistent worker. Tbe author is taking the liberty of 
quoting occasional statements from the work of this in- 
vestigator rather than summarizing fully any one study. 

iW. E. Grordon, A Study of the Retiabili^of Examinations Based upon ths New York 
Re^rUs* Examinations in the Social Studies, rh. D. Dissertation (1925), State Univeristy 
of Iowa. Published, in part, in Ruch et cl: Objectioe Examination Methods in the Soctat 
SUedies (Chicago: Scott, Foresman and Co., 1926), pp. 23-53. , , 

*Monroe*a average reliability coeffident was higjber than that found by the author and 
his associates, although the reliab^ies were approached by somewhat different procedures, 
making exact compansons impossible. 



TABLE 13 

Reliability Coefficients of Eight New York Regents' Examinations in the Social Studies, 
Year vs. Year, and Reader vs. Reader 
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In an early and important study* Wood states: “It was to 
get a strai^t reliability coefficient on the traditional essay 
examination that 117 booklets were re-graded. The cor- 
relation of the first grading with the re-grading is r=.663.” 
(It is to be noted that Wood obtained .905 for the reliability 
of an objective examination in the same subject.) 

In another place Wood reports a most interesting sidelight 
on the reliabUity of the marking of papers. He says: 

The facts of a subjective scale are well illustrated in the following anec- 
dote concerning the grading of history papers by a group of college pro- 
fessors of history in the summer of 1921. One of the five or ^ expert 
readers assigned to a certain group of history papers, after scoring a few, 
wrote out for his own convenience what he considered a model paper for 
the given set of ten questions. By some mischance this model fell into the 
hands of another reader who graded it in perfectly bona fide fashion. The 
mark he assigned to it was below passing, and, in accordance with the 
custom, this model was rated by a number of other expert readers in order 
to insure that it was properly marked. The marks assigned to it by these 
readers varied from 40 to 90.* 

Wood, as quoted by Symonds,® foxmd that in 1921 the 
College Entrance Board Examination in algebra (Mathe- 
matics A) had a reliability of 0.76. The geometry examina- 
tion (Mathematics C) had a reliability of 0.61. Such ex- 
aminations are three hours in length. 

In his most recent, extensive study of old- and new-type 
examinations. Wood reports these coefficients of reliability 
for ninety-minute essay examinations in modem languages:^ 


Subject 

Reliability 

No. OF Cases 

Subject 

Reltabiuty 

No. OF Cases 

French II 


1105 

Spanish II 

0.722 

1016 

French III 

0.738 

867 

Spanish III 

0.700 

629 

French IV 


85 

Spanish IV 

0.695 

95 


^From Ben D. Wood» Measurement in Higher Education (Yonkerson-Hudson^ New 
York: World Bo(^ ^mpany, 1923), p. 193. 

*Educatiottal Administratum and Supervision, Vol. VH (1921), pp. 301*'304. 

*P, M. Symonds, Measurement in Secondo!^ Education (New York: The Macmillan 
Compaq, 1^7), p. 297. Quoted by permission of the Macmillan Cmnpany. 

D, Wood, New York E^eriments with New-Type Modern Languogfl Tests (New 
Y(»rk: Ibe Macmillan Co., 1927), p. 115. 
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It diould be noted that Wood’s coefficients of reliability 
for new-type examinations in the same subjects showed a 
range of from 0.880 to 0.907, there being no over-lapping at 
all in the values of the reliability coefficients for the two 
types of examinations. 

In a study of examinations in law^ Wood presents further 
evidence on the relative reliabilities of old- and new-type 
tests in a college subject. After showing that traditional 
eind objective examinations over the same course in law 
showed an average correlation of 0.55, Wood shows that 
much of this lack of correlation is explained by the unre- 
liability of the examinations. He next attempted to de- 
termine whether the unreliability was chiefly to be attributed 
to either set of examinations or whether they were equally 
unreliable. The old-type law examinations yielded reliabili- 
ties of 0.59 to 0.73 with an average of 0.66. The reliabilities 
of the new-type tests ranged from 0.72 to 0.92 with an 
average of 0.81. 

Wood’s next step in the argument was to obtain measmes 
of the validity of each type of examination by correlating 
each with law-school marlis. To do this he compared the 
marks received by 215 students in pairs of law courses, with 
the following results (p. 7): 


Pairs cmp Law Courses 

Correlation of 
Essay Examination 
Grades With Marks 
IN Course 

Correlation of 
New-Type Test 
Grades With Marks 
IN Course 

1 and 2 

0.38 

0.80 

1 and 3 

0.41 

0.74 

1 and 4 

0.47 

0.72 

4 and 2 

0.47 

0.78 

4 and 3 

0.56 

0.74 

Average Correlation 

0.46 

0.76 


S'**”* Colutatna Law Raitw, 

Vol. XXV (1925), pp. 1-16, 
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EEDUCnON OF SUBJECTIVITY THROUGH SCORING RULES 

Reducing variations in teachers* marks by means of 
scoring rules. In Chapter I the question was raised as to 
the feasibility of attempting to make the traditional ex- 
amination more objective without changing the fundamental 
nature of such examinations. It is probably possible to 
reduce the subjectivity of essay examinations to a con- 
siderable degree by the formulation of and strict adherence 
to scoring rules or schedules. In a subject like arithmetic or 
algebra there is always the question of how to apportion the 
total credit between choice of the correct solution and pure 
accuracy of computation. In almost any sort of examination 
there will arise cases where the pupil’s thinking is correct, 
but his expression of ideas is faulty. Likewise we must face 
decisions as to the proper amount of penalty, if any, for 
carelessness, poor handwriting, errors in grammar, misspelled 
words, etc. 

It must be true that many of these issues could be settled 
by rule. The rule would not necessarily be absolutely de- 
fensible; in fact it need not be. It may be sufficient that 
there be uniformity in handling the examinations of different 
pupils. 

Two studies will be reviewed in some detail. These are 
probably indicative of the refinements possible through the 
use of scoring rules or schedules. Needless to say, the use 
of such guides will greatly increase the labor of marking 
papers, but at the same time they should increase the con- 
fidence of the teacher in her gradings. 

Kelly’s experiment. F. J. Kelly* had six fifth-grade 
teachers give the same examination in arithmetic to their 
pupils. Each teacher marked her papers but did not record 

^F. J. Kelly, 'Teachers* Marks,** Teachers College Contributions to Education, No. 66 
(New York: Columbia University, 1914), p. 83. 
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the marks on the papers. The superintendent had an able 
teacher prepare a set of rules for scoring. The teachers then 
regraded the papers, using the scoring rules. 

Table 14 shows the two sets of gradings, the six teachers 
being represented by the letters A, B, C, etc. 

The teacher who prepared the scoring rules also graded all 
the papers, using her rules. The marks of this teacher, who 
may be called the “judge,” were used as the basis of calculat- 
ing the entries in Table 14. The table is read as follows: 
A difference of twenty-one or more was found between the 
judge and two pupils marked by Teacher E when no scoring 
rules were used. Differences of from sixteen to twenty were 
found in a total of three cases, viz., by Teachers A, E, and F. 
The direction of the differences is also shown. 

The improvement from the use of scoring rules is very 
marked. If we take the position that disagreements of no 
more than five points are not very serious, almost ninety- 
five per cent of the 219 pupils were marked with reasonable 
accuracy when rules were employed, while in the absence of 
rules but sixty-two per cent showed differences of five points 
or less. Even more striking is the fact that there were but 
5.5 per cent of zero differences without scoring standards in 
contrast with sixty-three per cent when rules for scoring 
were used. 

The experiment of Fauber and Buck. 0. W. Fauber,^ 
under the direction of the author, repeated and extended 
Kelly’s experiment- 

Fauber’s procedure was that of asking forty teachers to 
grade the same arithmetic paper without any specific rules. 
The same paper was then graded by forty-eight teachers 
using a carefully prepared scoring sdi^ule. Table 15 shows 
the results. 

'Unpublished M. A. Thesis (1926), University of Iowa. 



TABLE 14 

Distributions of Differences Between Two Sets of Teachers* 
Marks on Fifth-Grade Arithmetic Papers— First, Without Any 
Effort to Unify the Methods Used, and Second, by a Common 
Standard (Modified from Kelly) 



Totals 

Mediaiis 


0 
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TABLE 15 


Tauber’s Results on the Scoring of an Eighth-Grade Arithmetic 
Paper With and Without Detailed Scoring Rules 


Marks Given 

Without Rules 

With Rules 

80-84 

1 


75-79 

1 

1 

70-74 

3 

11 

65-69 

7 

12 

60-64 

9 

16 

55-59 

6 

7 

50-54 

3 

0 

45-49 

3 

0 

40-44 

2 

1 

35-39 

1 


30-34 

1 


25-29 

2 


20-24 

1 


No. of Teachers 

40 

48 

Median 

60.1 

64.5 

Upper Quartile 

65.9 

69.5 

Ix)wer Quartile 

49.5 

60.7 

Range of Middle 50% 

16.4 

8.8 

Semi-interquartile Range 

8.2 

4.4 

Total Range 

22-84 

42-77 


The variability was reduced almost one half when rules 
were used; the semi-interquartile ranges being 8.2 and 4.4, 
and the total ranges being 60 and 35. 

Fauber’s results are in fair harmony with those of Kelly, 
although he did not succeed in eliminating subjectivity to a 
satisfactory degree when we consider that, even wi^ de- 
tailed guides to scoring, forty-eight teachers marked the 
same paper all the way from 42 to 77. 

In a second study Fauber took a paper in eighth-grade his- 
tory which was high in content value. This paper is called 
Paper lA in Table 16. He next changed this paper, re-writ- 
ing it in a careless fashion, making errors of various sorts in 
grammar, punctuation, spelling, etc. Parts were hastily 
scratched out, and a few iii blots were made. The re-written 
paper is termed Paper IB. No changes were made in content 
or thought. 
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TABLE 16 

Faubee’s Results on the Influence of Form on the Marking of a 
Paper in U. S. History 


Teacher No. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

Paper 1A 

96 

92 

93 

91 

91 100 

82 

92 

91 100 

76 

94 

Paper IB (Content) 

95 

92 

93 

91 

92 

100 83 

92 

91 100 

76 

94 

Deductions on Pa- 
per IB: 


Neatness 

10 

2 

5 

5 






4 

4 

1 

Form 


2 


5 






4 

5 

2 

Spelling 


2 


10 






4 

5 

2 

Grammar 


2 


5 






14 

3 

1 

Total. Deductions 

10 

8 

5 

25 

10 

0 

10 

0 

0 

26 

16 

1 

Final Grade on Pa- 
per IB 

85 

84 

88 

66 

82 100 

73 

92 

91 

74 

60 

88 


The rsinge of marks assigned by the twelve teachers to 
Paper lA was from 76 to 100, or twenty-four points. The 
range for Paper IB was from 60 to 100, or forty points (the 
range for content alone being from 76 to 100, or twenty-four 
points, as was the case for Paper lA). 

It seems clear that the range of marks is partly due to the 
variations in teachers’ practices relative to deductions for 
faulty l anguag e, neatness, etc. This increase is about sixty- 
six iier cent according to Fauber’s findings if we take the 
variation in content of Paper IB as twenty-four points and 
the variation in final marks as forty points, the additional 
sixteen points being the variability due to language factors. 

The conclusions to be drawn from the work of ICeUy and 
Fauber are: 

1. The variability due to subjectivity of scoring in the 
traditional examination may be cut at least in half by the 
use of carefully laid-down scoring rules. 
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2. In spite of this fact, the traditional examination re- 
mains highly subjective. 

3. Much of the variability in teachers’ markings of ex- 
amination papers arises from var 3 dng practices in penalizing 
pupils for grammatical, dictional, punctuation, spelling, and 
careless errors. 

4. Although it can hardly be held that essay tests may be 
refined sufficiently by the use of scoring rules, such examina- 
tions must be employed at times, and the use of scoring rules 
will eliminate some of the unreliability present. 

SUMMARY. DISCUSSION, AND CONCLUSIONS 

A number of investigations summarized. Table 17 was 
assembled from a number of sources; principally the writings 
of Monroe, Wood, and the author and his students. This 
table must not be taken without reservations, since such an 
assemblage of correlations necessarily includes a wide 
variety of different situations. To be specific, widely 
different conditions existed with respect to such matters as: 
the numbers entering into the correlations, the school sub- 
ject, the level of maturity of the pupils (elementary, secon- 
dary, and college), the lengths of examinations, differences 
in range of ability (heterogeneity), etc. 

Table 17 shows a median reliability coefficient of 0.59 
which is noticeably lower than the median value reported 
by Monroe. (See Table 9.) In view of all the facts, we must 
conclude that the central tendency of the old-type examina- 
tion is toward a reliability not far from 0.60 to 0.65; it is 
certainly less than 0.70 on the average. 

The investigations reviewed in this chapter are in com- 
plete harmony on all major issues. All point toward the 
subjectivity of teachers’ marks and the older forms of 
examinations. If space permitted it would be interesting 
to comment on these studies in greater detail, but, in the 
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main, the tables and summaries tell their stories without 
need for extended comment. 

TABLE 17 


Summary of Reliability Coefficients for Traditional (Essay-Type) 
Examinations, as Reported by Various Investigators 


Reliability 

Coefficient 

No. OF Times 

Reported 

.93 to .97 

9 

^ to .92 

12 

.83 to .87 

15 

.78 to .82 

24 

.73 to .77 

24 

.68 to .72 

19 

.63 to .67 

24 

.58 to .62 

22 

.53 to .57 

18 

.48 to .52 

13 

.43 to .47 

21 

.38 to .42 

17 

.33 to .37 

11 

.28 to .32 

11 

.23 to .27 

14 

.18 to .22 

8 

.13 to .17 

8 

.08 to .12 

3 

.03 to .07 

4 

-.02 to .02 

2 

-.03 to -.07 

2 

-.08to-.12 

0 

-.13 to -.17 

1 

-.18to-.22 

2 

-.23to-.27 

1 

Median==.59 

Total 285 


We must return to the list of objections with iduch this 
chapter cqjened. The first two have previoudy been dis- 
cussed at length. It is only necessary to state once more that 
the traditional examina tion is open to two serious limitations: 

1. Subjectivity of marking, and 

2. limited sampling. 
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Limited sampling is imavoidable in any examination, 
new-type or old-tjrpe. A comparison of the older examina- 
tion with the newer on the score of sampling would seem at 
first sight to present little ground for choice. This is not 
quite true for two reasons: 

1. As was shown in Chapter II, the two sorts of examina- 
tions differ in their theories of sampling. The traditional 
examination employs an intensive type of sampling; the 
new-type, an extensive sampling. There is considerable 
logical (and some experimental) evidence of the superiority 
of the extensive sample. (See Chapter II.) 

2. The new-type examination can cover far more grotmd 
in the same amount of working time because there is no need 
to spend time in writing a mass of words. The response by 
underlming, encircling, checking, etc., is so rapid that at 
least ninety per cent of the examination period is spent in 
thinking about the responses. With the traditional ex- 
amination a larger fraction of the time is spent in writing. 

Excessive writing of answers is uneconomical. Prominent 
among the objections to the traditional examination is the 
charge that it is wasteful of the pupil's time. In a sixty- 
minute examination a pupil spends from fifteen to thirty 
(sometimes more) minutes in writing his answers. If no 
writing were necessary, the examination might include at 
least twice as many questions. Worse still, most of the words 
which he writes convey little information about his real 
knowledge of the subject. Language requires a large 
number of “filler words,” useful to be sure for grammatical 
reasons, but useless for examination purposes if we view the 
examination as a measuring instrument. 

Traditional examinations encourage bluffing. This charge 
must be admitted, although something analogous, and per- 
haps fully as objectionable, is inherent in many of the new- 
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t37pe tests. The reference is to the opportunity for guessing 
correct answers in such tests as the true-false and multiple- 
choice. 

The phenomenon of bluffing needs no comment. It exists 
in all examination situations. The nature of the essay 
examination makes it open to such abuses. Pupils know 
with a somewhat xmcanny intuition that teachers are loath 
to mark any question zero unless the question is left absolute- 
ly blank. There is an ever-present, and perfectly human and 
philanthropic, tendency to reward effort, no matter how 
misguided and futile. Laudable as are such “weaknesses 
of the flesh,” they do not serve the ends of measurement. It 
wiU be recalled that four teachers out of eighty-nine rated as 
better than zero the statement, “The five largest cities of the 

United States are (1) navada (2) Arkansas ” in reply to 

the question, “Name and locate five of the largest cities of the 
United States, and name their leading industries, exports, 
and imports.” 

There is a genuine difference between old- and new-type 
examinations in one respect. When a pupil is confronted 
with a broad discussion question, he in one sense chooses 
the line of attack. He may be entirely ignorant of the 
import of the question, but for the time being he is the general 
in charge. He can naively “naisunderstand” the question 
and write on some alien topic where his meager store of 
knowledge can be turned to better advantage. He can 
at times go around, under, or over the topic in a vay skillful 
maimer. He has nothing to lose, and he might win in the 
hands of a philanthropic teacher. An objective test, on the 
contrary, forces him to “face the music.” In this case the 
teacher chooses the battleground. The examination forces 
the pupa to react to those things which the teacher deems 
important. It gives her, as it should by right of more mature 
wisdom and judgment, the leadership in the examination 
period. 
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On the whole, exchanging the disturbing factor of bluffing 
for the admitted danger of guessing (in many new-type tests) 
is a gain, since there is no mathematical formula for minimiz- 
ing bluffing but there is a more or less adequate statistical 
means of allowing for guessing. 

The answer to a question of the discussion type is a com- 
plex thing. It is made up of many sorts of dements. To 
mention a few: 

1. There are some statements which are true, and to the 
point. 

2. There are other statements which are true, and beside 
the iKDint. 

3. There are statements which are absolutely false. 

4. There are many things which the teacher hoped would 
appear in the answer, but which are missing. 

5. There are half-truths, garbled statements, and ambigu- 
ous statements which reduce in extreme cases to meaningless 
sequences of words. 

How can such diverse dements be fused into a single 
judgment? The answer seems to be that they cannot; at 
least the weight of the evidence of this chapter is that they 
cannot be evaluated with any high degree of accuracy. 

Final condnsions. The dement of subjectivity in the 
traditional examination is a source of marked unreliability. 
Such examinations have been found to be wasteful of time 
in the sense that excessive writing of words which convey no 
knowledge of accomplishment to the marker of the paper is 
required. Such wasted time would allow the answering of a 
much larger number of short-answer or objective questions. 
An objective test over the same ground covered by a tradi- 
tional examination would yield a far more extensive s ampling 
of the pupil’s knowledge. Subjectivity of marking may be 
reduced about one-half by the adoption of and adhwence 
to a set of scoring rules when essay examinations are to be 
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graded. Such scoring rules increase the labor of scoring 
papers, but are nevertheless highly desirable. The tradi- 
tional examination should be employed principally when the 
subject-matter does not lend itself to completely objective 
measurement; even in such cases the results must be taken 
with a great deal of conservatism. A combination of tradi- 
tional and new-t3T»e examinations should probably be used 
in many school subjects, especially where present knowledge 
is unable to suggest purely objective types of measurement. 



CHAPTER IV 


ADVANTAGES AND LIMITATIONS OF 
OBJECTIVE EXAMINATIONS 

General statement. The general course of argument pre- 
sented in this chapter may be made clear by this outline: 

I. Advantages of objective examinations 

1- Objectivity (freedom from personal opinion) in 
scoring 

2. Extensive sampling 

3. High reliability per unit of working time 

4. Economy of scoring 

5. Freedom from bluffing 

6. Greater control of the examination system by the 
teacher 

II. Limitations of objective examinations 

1. No provision for language training 

2. Open to guessing and chance 

3. Reputed to measure only factual memory 

4. Said to be an unnatiural method of using school- 
acquired information 

5. Test recognition rather than spontaneous recall 
Many of the above points have been commented on at 

some length in preceding chapters. The present chapter is 
therefore somewhat of a stmmiary and co-ordination of 
points of view. 

ADVANTAGES OF THE OBJECTIVE EXAMINATION 

Objectivity of scoring. Objectivity of scoring needs little 
further comment. Chapter III has shown the dangers of 

112 
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subjective scoring. When 100 (or more) teachers grade the 
same paper all the way from 28 to 92 (as Starch found for a 
geometry paper), or when they evaluate the seime question 
so differently as from 3 to 20 (as Ruch found for a geography 
paper), the only conclusion which can be drawn is that such 
examinations do not measure the pupils. 

Objectivity is a prime essential for reliability of measure- 
ment. Much of the educational process must be highly sub- 
jective, but measurement implies accuracy. The examina- 
tion should exclude so far as is possible the personal opinions, 
biases, whims, and temperaments of teachers. The ex- 
amination should be a measure of the pupil, unadulterated 
by factors which represent the psychological reactions of 
the teacher. 

The objective or new-type examination can be made 
wholly or almost wholly objective. The traditional ex- 
amination cannot. It follows that a high degree of ob- 
jectivity can only be had through the new-type test. To the 
extent to which this is true, the objective test has a dean-cut 
advantage over the older forms of examinations. 

Of the two prindpal sources of unreliability in examina- 
tions (subjectivity and limited sampling), the former only is 
completely eliminable. It follows, therefore, that this source 
of inaccurades in examination marks should be minimized 
or diminated completely. 

Thomson mentions the question, “Describe the universe 
and give two examples” as the extreme form of such sub- 
jective questions.^ This question may not be real, but it 
has as its “twin” such a question as “What were the con- 
tributions of Babylonia to dvilization?” or, “Describe the 
geography of Greece.”* It is only fair to contrast such ques- 
tions with the misguided attempt at objectivity reported by 

*0. H. Thomson, Instinctf IntelligeMe, and Character (New York: Longmans, Green 
and Co.tl925). p. 202. . 

^See W. J. Osburn, Are We Making Good ai Teaching History? (Bloomington, Illmois: 
Public School Publishing Co., 1926), for tabulations of the kmds of questions which history 
teachers acti^y ask. 
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BrinMey, ^ who found the following; “The 

who was ” The ex- 

pected answers (it may be interesting to learn) were “man,” 
“discovered America,” and “Columbus.” 

Extensive sampling. There can be no real dispute on 
this point. The ordinary five- or ten-question discussion 
t3q)e of examination requires a great deal of writing. It 
follows that a large fraction of the words actually employed 
in phrasing an answer to such a question are “fiUer” words, 
i. e., words needed to complete the sentence structure but 
valueless in adjudging the merit of the pupil’s answer. The 
word counts of Thorndike and Horn have shown that the 
most frequently recurring words in the written language are: 
a, about, all, also, am, an, and, any, are, as, at, be, been, by, 
can, do, for, get, has, have, he, I, if, in, is, etc. None of these 
words, ordinarily, may be expected to convey any knowledge 
about the child’s achievement. 

Rough studies by the author of the answers to questions 
given by pupils on himdreds of vmiform state eighth-grade 
examinations show that from four to ten lines of written 
answers are given to each question. When these answers 
were analyzed, the central tendency seemed to be that from 
four to seven different ideas or facts were reported. This 
means that the effective length of a ten-question examination 
was from forty to seventy items. Such examinations re- 
quired (or were allowed), usually, sixty minutes. The 
sampling per minute of working time could hardly have been 
more than one idea {or fact) per minute. This is very small 
indeed. Experiments reported in Chapter XI show that 
sixty minutes of objective testing would have permitted at 
least three times as extensive an examination. 

The fact is that the usual discussion or essay examination 
is very wasteful of the pupil’s time. Half to three-fourths 


*S. G. Brinkley, Vahies of the New-Type Examinations in the Hi^ School (New York: 
Columbia Umversity, Teachers College Contributions to Education, No. 161, 1^), p. 36. 
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of the examination period is spent in writing words, useful 
enough for purposes of sentence structure but quite valueless 
in conveying to the teacher any facts about the pupil’s 
knowledge of the subject. By checking, underlining, or 
inserting occasional words, from three-fourths to nine- 
tenths of the writing may be eliminated in an examination. 
This results in an obvious economy of effort, an increased 
allowance of time for thought, and an increased.extensity of 
sampling in a given amount of examination time. 

As has been said repeatedly in this volume, reliability 
depends principally on two factors, objectivity and extent 
of sampling. The unreliability arising from subjectivity 
may be eliminated completely. Unreliability due to samp- 
ling cannot. It will always be present since measurement is 
invariably a sampling; never complete. The key to reliable 
educational measurement is therefore extensive sampling in 
order to secure an accurate and stable measure of each pupil. 

In Chapter II it was shown that the tmderlying theory of 
sampling was different in the old and the new types of ex- 
aminations. The former represented an intensive coverage 
of a very small mmiber of topics; the latter represents a 
more extensive sampling of topics but a less thorough cover- 
age of any one topic. When time is limited, as is usually 
the case with examinations, somewhat more reliable results 
are to be expected from what the author has termed the 
extensive sampling. 

High reliability per unit of working time. There is 
nothing of importance to add to the discussion of the two 
preceding sections. Economy demands that no unnecessary 
writing be done during the examination. If discussional 
types of tests are to be given, we must face frankly the situa- 
tion of giving two to three times as much time to examina- 
tions as is required by the newer and more economical ob- 
jective tests. The reliability of the ordinary ten-question 
essay examination of sixty minutes can hardly be held to 
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average more than 0.60 to 0.70 in view of the available 
evidence. In later chapters of this book (especially Chapter 
XI) it will be shown that sixty-minute objective tests may 
show reliabilities as high as from 0.70 to 0.90, provided care 
is used in formidating such tests. The sixty-minute ob- 
jective test may contain, not five or ten questions, but as 
many as from 100 to 200 items. 

The author once studied a series of written examinations 
as administered by a state department of public instruction 
to all eighth-grade pupils. He selected those subjects which 
are covered by a well-known battery of standard educational 
tests, and determined the reliability of this composite. 
By the usual prediction formula devised by Spearman and 
Brown it was shown that in order to equal the reliability of 
the standard test (of about two and one-quarter hours 
working time), it would be necessary to do tluree and one- 
half weeks of testing by the type of examination actually 
employed. This may be an extreme case, but it is a good 
illustration. The name of the state is withheld, although no 
especial reflection upon that state is implied, other than upon 
the continued adherence to such an examination. 

We dare not close this section without reference to one 
experimental study which is not entirely in harmony -with 
our point of view. Brinkley, (comparing old- and new- 
tjTpe examinations) foxmd that “given tests of equal length, 
as measured by time spent in testing, and prepared by teach- 
ers with some training in the matter of test construction, 
one t 3 q)e of test 5 delded practically as good results as another 
for measuring achievement in totory in the senior high 
school.”^ 

This study of Brinkley’s is somewhat disconcerting and 
difficult of explanation. It certainly stands as an exception 
to the findings of such investigators as Wood, Toops, De- 

*S. G. Brinkley, Values of New-TyPe Examinations in the Hi^ School (New York: 
Columbia University, Teachers College CorUrtbuitons to Education, No. 161, 1924), p. 58. 
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Graff, Stoddard, and Ruch. Symonds, after summarizing 
and reviewing the earlier literature on objective tests (in- 
cluding the work of Brinkley), says: 

The evidence is dedsive that new-type examinations are more reliable 
than traditional examinations. In fact, the evidence is more decisive than 
an off-hand inspection of the above tables [not reproduced here] would 
indicate. One usually wishes to compare examinations for reliability under 
equal working times. Monroe does not state the average length in time of 
the sixty-six sets of examinations used by him in getting his reliability 
coeffidents. The median number of questions in an examination was seven, 
and one may assume that the examination time was at least an hour . . . 
Ruch’s correlation of .896 for 100 recall questions was based upon 18.7 
minutes of testing time. Thus it is a matter of comparing an average 
reliability of .65 for the traditional examination obtained in sixty minutes, 
more or less, of testing time with a reliability of .90 for a new-type test 
obtained in nineteen minutes of testing time. The superiority of the new- 
type test, so far as rdiability goes, is plainly evident^ 

Brinkley himself states that “the differences ware not in 
any of the.cases large.” The new-t37pe tests made by class- 
room teachers after some preliminary training were not 
quite equal in validity to the old-type tests, but were better 
than those made before this training was given. On the 
other hand, the tests made by Brinkley himself were not 
inferior to the traditional examinations, but, on the whole, 
were somewhat superior. 

The experimental group used by Brinkley was not large. 
Starting with 163 pupils, only ninety-five sets of records 
were finally available. This tended to obscure the signifi- 
cance of any differences noted. It is also to be rMnembered 
that each of the ten tests used contained but thirty-three 
items, divided as follows: 

True-false 100 Word or Phrase Answer. 52 

Multiple Choice 89 Arrangement 20 

Completion 66 Essay 9 

iP- M. Symonds, Measurement in Secondary Ediaation ^ew York: The Macmillan 
Co., 1926), p. 298. Quoted by permission of the Macmillan Co. 
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Moreover, the plan of scoring the essay examinations 
was one which is not typical of prevailing practice, and one 
certain to improve the reliability and validity of its results. 
Brinkley says: 

The scoring of the essay examinations was rendered as ccmsistent as 
possible by listing beforehand the items that should be included in a 
correct answer or the comparison that should be made. Often a set of 
papers was read and additional items entered on the key as a result before 
the real scoring was begun. Values were then assigned to the different 
items so as to give a total score of 10 for each question, and the papers 
scored by comparison with this key.* 

As has been said, this plan of scoring is not tjrpical of the 
usual practices, and was one certain to improve the reliabili- 
ties of the essay examinations. This study by Brmkley, 
although carefully executed, should be repeated with larger 
numbers, and perhaps under conditions somewhat more 
typical of actual practices, before it should be held to refute 
a number of more extensive and equally careful experiments 
which have yielded quite different restdts. 

Economy of scoring. There is no dispute on this point. 
Objective tests if planned carefully as to mechanical features 
may be scored from two to five or more times as rapidly as 
can essay tests of comparable length. The term “compar- 
able” has been used advisedly since there are a number of 
bases for such comparisons. We might compare old- and 
new-type tests of equal lengths in terms of (1) numbers of 
items, or questions, (2) equal working times, or (3) equal 
reliabilities. 

The first basis is, of course, quite indefensible since an 
objective test item is quite a different unit from an essay 
question. To compare tests of the two types having equal 

»See Chapter III for evidence that subjectivity might well have been reduced by one- 
h^ by Brinkley’s use of carefully drafted rtiles for scoring. Brinkley does not seem to 
give adequate weight to this fact when he concludes that old- and new-type examinations 
are roughly equal in merit. A fairer comparison would have been to measure objective 
tests against essay tests read by the prevailing methods actually in use by teachers, i. e., 
without detailed scoring rules. 
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working times is a better basis, but, as has been shown, the 
objective test is in effect a longer test under such a condition 
since it is more economical of time. On the whole, it is more 
accurate to think of the relative economies in terms of tests 
which are equally reliable. This will mean, in general, that 
a ten-question essay examination is to be compared with 
objective tests of perhaps forty to seventy-five items. The 
latter can be scored at rates of twenty to thirty or more per 
hour, depending upon the type of test and the mechanical 
arrangement of the responses, etc. Essay examinations of 
ten questions cannot be scored, as a rule, faster than from 
ten to fifteen an hour. 

Freedom from bluffing. This advantage has already been 
pointed out (Chapter III). It has its analogue in the so- 
called guessing element in many objective tests. A detailed 
consideration of guessing is reserved for Chapter XII. 
Bluffing is even less desirable than guessing, since partial 
control of the latter is possible through mathematical 
formulas. 

Greater control of the examination bj the teacher. So 
far as the author is aware, this superiority of the objective 
examination has not been commented upon by other 
writers. 

It is well known that a pupil who is shaky on the answer 
to an essay question can often “get by” by pretending to 
misunderstand the intent of the question, and then writing 
on something more to his liking and information. To some 
extent, therefore, the pupil chooses his reactions to such a 
question. The question may encourage this in two ways: 
(a) by ambiguity of statement, or (b) by intention, i. e., 
some teachers prefer to give the pupil great latitude in choos- 
ing the direction of his replies. Desirable as the latter may 
be in theory, it makes for greater subjectivity in evaluating 
responses. 
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With the new-type examination the teacher forces the pupil 
to react, one way or another, to what she thinks is important. 
(The items of the test are supposedly her concept of what is 
of prime importance.) There can be no doubt that this 
practice extends the teacher’s control over the examination 
situation. Moreover, the teacher has exactly the same right 
to control the nature of the pupil’s reactions during an ex- 
amination as she has to select subject-matter, methods, etc., 
in the larger phases of instruction. Objective examinations 
cannot “misfire” to the same extent as essay tests which so 
often prove imsatisfactory because the pupils did not suc- 
ceed in writing on the issues intended by the teacher. 

LIMITATIONS OF THE OBJECTIVE EXAMINATION 

Ne^ect of language training. This admitted limitation 
of the new-type test was discussed in Chapter III, where 
the daims of the traditional examination were also con- 
sidered and criticized. Suggestions were made as to 
methods of attaining increased language facility by written 
examinations. 

The “guessing” element in objective tests. This matter 
has also received passing attention in preceding chapters. 
Chapter XII of Part III will take up in detail the history of 
the controversy on the degree of invalidation of true-false 
and multiple-response tests resultit^ firom the large amounts 
of guessing possible.^ 

Objective examinations measure memory only. This is a 
charge which is easily made by opponents of the new-type 
examination. Moreover, this charge too often makes the 
tadt assumption that the traditional examination is free 
(or at least, freer) from this criticism. 

iThe author cannot forebear quoting a somewhat incoherent communication recently 
reeved from a teacher. Speaking of true-false tests, she said, in part (italics mine): 
The element of guesswork is strong and the gamboling instinct is brought to bear. If they 
(the pupils?) are penalized, i.e., ‘miss one, charge two is worse still, ft is dishonest." 
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If there is any one place in examination practices where 
loose thinking prevaUs, it is relative to the so-called 
“thought” question. Teachers and educators pay lip service 
to the thought question and then proceed menily to ask 
pupils to “Name the principal products of New England” 
or to “List the main causes of the Revolutionary War.” 
The head master of a famous military school once wrote the 
author to the effect that he had taken a six-weeks' course 
in one of the largest universities for the training of teachers. 
For six weeks the professor waxed warm and loud in his 
praises of the thought question — and never once gave a 
concrete illustration! 

It is not the intention either to deny the existence of the 
thought question or to decry its merits. It is merely sug- 
gested that we clarify our ideas on the point before attempt- 
ing criticism of any type of examination. Perhaps it be- 
hooves the author to come forth with good examples of 
thought questions at this point. This challenge will not be 
met fully for many reasons, not to mention possible inability. 

Suppose we take as a fair approximation to a thought 
question, “Why do many taxation authorities believe that 
the income tax is the fairest form of taxation yet developed?” 
The average man, whose experience with taxation consists 
in little mere thmi semi-aimual grumbling when taxes fall 
due, might, if intelligent, reason out certain valid arguments 
in favor of this proposition. To the extent that he did this, 
the question is a thought question. On the other hand, a 
high-school senior or junior, having just completed a course 
on economics, might answer in one-two-three order the 
points concisdy summarized and presented on page 289 of 
his textbook. The saim question is a matter of memory 
in this case. 

The point is: The difference between a thought and a memory 
question does not reside principally in the question itself but in 
the mental background of the pupil. Paul Smith, an intelli- 



122 THE OBJECTIVE OR NEW-TYPE EXAMINATION 


gent but indolent boy, may sit next to George Brown, of 
moderate ability but good memory, and receive the same 
instruction and write the same examinations. Paul may do 
a “lot” of thinking, much of it original, when examination 
day comes. George may merely place on paper his carefully 
hoarded information. George’s paper may be superior by 
far, but Paul did whatever thinking was called forth by the 
thought questions asked. 

In a sense the difference between thought and memory 
questions is chronological. The question in mathematics 
which is an “original” for the eighth-grade pupil reduces to 
sheer memory by the end of his high-school coturse. 

As an approximation to a good thou^t question, the 
following is submitted: 

State what you might have prophesied as to the future of the Roman 
republic, if you had lived in the first century before Christ and had known 
the following facts: Marius becomes consul for the seventh time; Sulla is 
given the title of ‘"Perpetual Dictator"'; Caesar becomes dictator for life. 
State additional facts that support your conclusion.^ 

The foregoing is not an easy task for the average high- 
school graduate. Compare it with the following attempt to 
objectify the same question. 

Directions: Check (X) the statement which expresses what you might 
have prophesied as to the future of the Roman republic if you had lived 
during the first century before Christ and had known the following facts: 

Marius becomes consul for the seventh time. 

Sulla is given the title of ‘"Perpetual Dictator.” 

Caesar becomes dictator for life. 


1 The republic was on the verge of developing a greater de- 

mocracy. 

2 The army becomes less aristocratic, and Marius enlists all 

men who desire to fight. 

3 The senate desired to grant greater economic rights to the 

working class of Rome. 


»Taken from an examination of the New York State Regents in History, Major Se- 
quence, Course A, 229th High School Examination, June 20, 1923. 
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4 Civil wars and the military nile of one man power would in 

time overthrow the republic. 

5 The rule of the assembly and its leaders was about to triumph 

over the rule of the senate. 


Directions: Check two (2) additional facts which support your con- 
clusion as indicated above. 


1 Rome became a great manufecturing city, thus changing the 

political organization. 

2 The Italians wanted citizenship because the municipal govern- 

ment could not rule successfully the entire peninsula. 

3 The wealth brought in as booty from foreign wars greatly 

lessened the taxes which the poor were forced to pay. 

4 Rome would not pass laws for the relief of the poor, causing 

many to die of starvation. 

5 The women of Rome enjoyed greater freedom than the women 

of Greece. 

6 - Rome conquered practically all of the known civilized world. 

7. Greek culture was acknowledged to be superior to Roman. 

8. Farmers of Italy, unaNe to earn a living, came to Rome in 

great numbers in search of work; the resulting imemployment 
created an economic and political problem leading to the 
passage of the Free Com Laws. 

9 Rome lost interest in education and culture. 

10 The Roman Empire required all people to wOTship the emperor. 


The objective version of this thought question is not sub- 
mitted as a *‘moder* question. The fraii^ of this question 
found his task a difficult one because he was not at all certain 
what the original question called for. The objective form 
of the question does have suggestive value as to the technique 
of objective thought questions. It is an open question 
which form of the question is more valid and reliable; our 
experimental evidence favored the objective question in 
spite of whatever faults it may have. There is small chance 
of earning a very high score by guessing, and the scoring is 
about as simple as can be imagined. 

It should be noted that any desired degree of discrimina- 
tion of thought can be had by careful framing of the state- 
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ments from which the pupil must choose. In such a ques- 
tion some statements should have high merit, some be true 
but hardly pertinent, others should be quite beside the point, 
and some may well be entirely false. 

While we are considering tWs particular question, we may 
as well scrutinize the other questions from the same examina- 
tion. The reader must decide which are the thought ques- 
tions, if any. 

Write on architecture and building among the early Egyptians, touching 
on (d) kinds of structures built, (b) general plan and appearance, and (c) 
materials used. 

What geographic features of Greece favored (a) the growth of small 
states, and (b) commerce. Explain fully in each case. 

Mention two valuable contributions to civilization resulting from the 
barbarian invasion of the Roman Empire. 

Describe an attack on a medieval castle, pointing out the difficulties 
to be overcome and the means used to accomplish the downfall of the 
defenders. 

Write on the Magna Charta, touching on (a) the circumstances under 
which it was granted, (b) three provisions, and (c) its importance in history. 

In what three parts of the world were England and France rivals during 
the 18th century? Which country gained territory as a result of this 
struggle? Where was this territory located? 

Write in detail of the services that Peter the Great rendered to Russia. 

Wocd, apropos of modem foreign language tests, has made 
some very pointed remarks on this question of memorizing 
vs. thinking: 

There are few teachers who do not admit that the objective tests are 
better as measuring devices, but some teachers fear that the objective tests 
are pedagogically unsoimd, and that they will tend to mechanize teaching 
and produce what is called “dead uniformity." Specifically, it is feared by 
some that the objective passive vocabulary tests will cause students, aided 
and abetted by their teachers, to “memorize mere lists of words." Thus it is 
feared that students might make a serious breach in future objective tests 
by the simple and effective device of memorizing the small matter of the 
two or three thousand most frequently used words in each language! The 
reader can judge for himself whether such a result would be impedagogical 
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or calamitous, or an unhoped-for blessing. It may well be that the ob- 
jective vocabulary test may produce such a miracle; but the proponents of 
new-type tests have never been optimistic enough to hope that tiiis charge 
against their tests would turn out to be true.^ 

There are a number of similarly penetrating refutations of 
current prejudices against the new-type examination in this 
recent monograph by Wood, The teacher of languages, 
especially, will wish to read the whole discussion. The 
notion that pupils will study isolated facts as a preparation 
for objective tests is a matter closely related to the thought 
question since it is the t3q)e of question (not its subjective 
or objective form) which determines the attitude of the pupil. 

Another angle of the thought question has ordinarily 
escaped attention, viz,, that once a situation has been used 
for the purposes of evoking thought, it is almost valueless 
(as a thought question) for the same pupils at any later time. 
At a future date the original thought question will be an- 
swered mostly from memory. A thought question must 
thus be an “original” in the sense that facts common to all 
pupils must be organized into a reaction to a novel situation. 
Certain it is that in most cases we cannot be sure of what 
is thou^t and what is memory, either from inspection of 
the question or from scrutiny of the pupil’s reply. 

Returning to the two versions of the question in Roman 
history, what right has any one to say in advance of actual 
experimentation that one version is any more conducive to 
the evil of cramming than the other? 

The framing of thought questions is an art— and a rare 
art so far as the author’s experience goes.* 'Diought 
questions do exist; certainly they are valuable; but how 

IB. D. Wood, New York Experiments with Neio-Type Modem Language Te^ (New 
York: The MacmiUaii Co.,, 1927), pp. 96-97. Quoted by permission of tfie Macmillan Co. 

KChe aut^r bas for years beem using as an assignment in certain classes the preparation 
of ten thought questions. Fully two-toirds (usuailly more) of the questions subxmtted by 
classroom teacters bear little or no resembl^ce to genume thought questions. Typical 
examples are: *^Wbat are the products of New Endand?** “What causes fainting?” How 
does a siphon work?” “Why did Columbus think “me world was round?” Why is tobacco 
injurious to chUdren?” etc. 
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can we guarantee that they evoke thought simply because 
they are expressed in a form which makes a thoughtful re- 
sponse possible? 

Psychologically there is a considerable degree of interrela- 
tionship between memories and thoughts, between ideas and 
facts. We reason with facts. We cannot reason correctly 
without them. Let us admit that there exists a problem 
relative to the framing and use of thought questions and 
refrain from assuming, a priori, that the traditional examina- 
tion occupies a favored position in this respect. It may 
be true, but there is no proof for such a belief. It is both 
more scientific and more stimulating to assume that the 
new-type e x amination has great possibilities in this direc- 
tion, if for no other reason than its greater objectivity. It is 
futile to hold to a form of question, whether it be thought- or 
fact-provoking, if it cannot be evaluated fairly. 

Objective examinations require mimeographing or print- 
ing. This is a practical objection which must be admitted. 
Some kinds of objective tests (the true-false, particularly) 
may be read aloud to the pupils. This may work fairly well 
with older children, but it is always a second-rate proc^ure. 
Mimeographing takes time, and stencils cost considerable 
money.i Paper costs are roughly the same, as the answers 
must be written in any case. If there is any decided superi- 
ority of the new-type examination, school budgets must be 
made to carry the added expense. The total costs appear 
infinitesimal in comparison with modem school expenditures 
for (often) less essential (but perhaps more ornamental) 
equipment than mimeographs and mimeograph paper. 

Objective tests are onnatural and impedagogical. This 
criticism is another which falls glibly from the tongue of the 
critic. Wood has already disposed of one angle of the 
matter. The author would like to raise the question as to 

^Spme teachers find a hectograph effective and relatively inexpensive in making copies 
of objective tests. Forty or fifty legible copies can be made from a single stencil. 
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what is the “natural” or “pedagogical” form of questioning. 

It would be most instructive and valuable to study the 
exact forms in which school-acquired knowledge is used in 
actual life. These modes would then be our answer as to 
what are the natural or pedagogical forms of questions. 
As a rou^ substitute for such experimental investigations, 
the author has made a conscious effort to observe the ways 
in which adults use their concepts and information. One 
huitful source of hints has been the conversations in the 
smoking compartments of Pullman cars. Thus far nothing 
has been found which resembles closely any variety of school 
question-and-answer with the possible exception of a sort 
of true-false statement. Cert^nly no person after leaving 
the public schools has ever been called upon to write or to 
recite answers to such a question as “Give the causes of the 
Revolutionary war” or “Explain in full the digestion of 
carbohydrates.” We do hear adults make statements to 
the effect that “the League of Nations is a flat failure” or 
that “the Monroe Doctrine is nothing more nor less than a 
'chip on America’s shoulder.’ ” To such an assertion (in 
the smoking compartment) there is speedy and dogmatic 
challenging, attempted support and refutation, and a final 
settling down to argument. The mental set in the ordinary 
political, economic, or social argument is as near to the true- 
false attitude as to anything which goes on in school. 
Whether it be the riding qualities of the new Ford, the pay- 
ment of European debts, or the batting of Babe Ruth, noth- 
ing closely akin to the question of the traditional examina- 
tion emerges. 

We are densely ignorant of what is “natural” as a t3rpe of 
examinat ion question. If naturalness is a prime desidera- 
tum of examining, the solution must come from an analysis 
of adult needs and usages; not from assumptions that the 
old is natural and pedagogical, but that the new is lacking in 
such values. 
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As has been stressed before, the examination should be 
nothing more nor less than a sampling of the kinds of activities 
which go on daily in the classroom. It would be most dis- 
appointing to believe that the traditional written question- 
and-answer represented the heights to which pedagogy has 
ascended to date. No t3T)e of examination is very natural 
in the sense of social utility. The “best” examination must 
be, for the present at least, that one which by experiment 
can be shown to be the most valid and reliable. Such 
criteria can be made to rest upon a far more sound basis 
than can the cries of “mechanization,” “unpedagogical,” 
“dead uniformity,” “pure memorization,” etc. It is a queer 
sort of teacher who can instruct pupils nineteen days out of 
a month in such a way as to avoid Scylla, but on the twen- 
tieth, or examination day, steer directly at Charybdis. 

Objective tests measure recognition rather than spon< 
taneous recall. Many tjqjes of objective tests call for 
selection among stated alternatives. The traditional ex- 
amination called for spontaneous recall. This difference 
has been deemed a superiority of the older examination. 
The question is a difficult one to decide. It is something 
like the issue discussed in the preceding section. We do 
not know what knowledge should always be at our finger 
tips and what knowledge will suffice if it enables us to select 
truth from error when both are presented. We probably 
need both kinds of knowledge. The author prefers to leave 
this question an open one in the mind of the reader. Certain 
of the objective-test devices (the simple-recall and the 
completion) resemble greatly the traditional examination. 
Others (true-false, multiple-choice, matching, etc.) are 
recognitive or selective operations. Very little of the ac- 
quisitions of the schoolroom need be retained in a form 
capable of immediate recall. For many purposes it is 
sufficient that the right facts, ideas, and concepts “come to 
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mind” when choice between alternatives is presented. On 
the whole, there seems to be no certain knowledge upon 
which a just criticism of objective examinations can be 
made on the score of recognition vs. recall. Here again is 
a promising, although very difficult, field for investigation. 



CHAPTER V 


STUDENTS’ ATTITUDES TOWARD 
EXAMINATIONS 

Introductory statement. It is probably agreed that some 
thought should be given to the attitudes of both pupils and 
teachers toward examinations. Criticisms from either 
source may at times suggest changes in methods of educa- 
tional measurement, provided, of course, that such criticisms 
are constructive in character. Both teachers and students 
may be expected to be biased on the issue of examinations vs. 
no examinations, and these biases may run in opposite 
directions. Pupils often dislike examinations merely be- 
cause they represent irksome labor. If the examination or 
test is viewed as part of the educational process, we need 
not abandon it because of lack of popularity with pupils. 
Changing a punctured tire is a nuisance and extremely un- 
pleasant, yet few would conclude that the solution is the 
abandonment of automobile transportation, or even the 
running of cars on a flat tire. Most persons agree that the 
tire should be fixed regardless of transitory feelings. Exam- 
inations can certainly be made more pleasurable except for 
the minority who resent any “showdown” on their knowledge 
or lack of Imowledge. 

This chapter will present brief abstracts of several studies 
of students’ attitudes toward examinations, together with a 
short summary of these investigations. No space will or 
need be devoted to the non-experimental literature (and it is 
somewhat voliiminous) on the continuance or rejection of 
the examination system. The ipse dixit arguments are 
principally based upon the type of examination which 
appears to be passing in large measure. 
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Somers’s study of the attitudes of students toward ex> 
animations. Somers^ has verified the common observation 
that pupils generally have marked aversion to the taking of 
both oral and written examinations. He asked forty-five 
teachers and 163 college students to rank their attitudes on 
eighteen activities, examinations included, on three scales, 
as follows: pleasant-unpleasant, valuable-worthless, and mor- 
al-immoral. The attitudes of students and teachers are 
surprisingly similar. 

TABLE 18 


Results of Somers and Gallagher and Ruch for Eighteen AcrmriES 
ON a Scale of Pleasant-Unpleasant 


Activity 

(1) 

Rank Given 
BY Teachers 
(Somers) 

(2) 

Rank Given 
BY Students 
(Somers) 

Rank Given 
BY Students 
(Gallagher 
AND Ruch) 

Attending a concert 

1 

2 

5 

Reading a story 

2 

1 

2 

Attending a movie 

3 

4 

4 

Witnessing a basketball game 

4 

3 

3 

Going on a field trip or excursion 

5 

7 

9 

Attending classes 

6 

6 

7 

Going to a circus 

7 

9 

6 

Attending a convocation or assembly . 

8 

8 

8 

Laboratory experiments 

9 

11 

12 

Writing a story 

10 

10 

11 

Attending a dance 

11 

5 

1 

Cleaning your room 

12 

12 

10 

Washing dishes 

13 

13 

14 

Writing a theme or composition 

Taking an examination (state or dis- 

14 

14 

13 

cuss type) 

15 

16 

15 

Taking an oral quiz 

16 

15 

16 

Pulling weeds 

17 

17 

17 

Digging a ditch 

18 

18 

18 

Numbers 

45 

163 

166 


The author repeated the major part of Somers’s study 
upon a group of 166 students and teachers, about one-fifth 

iGrover T. Somers, “Students* Attitude Toward Examinations/* Bulletin of the School 
of Education, Indiana University, Vol, III, No. X (3eptember, 1926), pp. 1-48. 
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of the group being experienced teachers. The students 
comprised juniors, seniors, and graduates in roughly equal 
numbers. Mr. Edward D. Gallagher summarized these 
data, using the procedure which Somers had employed in 
his study. 

Table 18 shows the ratings upon the Pleasant-Unpleasant 
scale. Columns (1) and (2) give Somers’s findings and 
colunm (3) shows the results obtained by Gallagher and 
Ruch. 

Teachers and students seem agreed that the taking of 
examinations is little more satisfsdng than weed-pulling and 
ditch-dicing! 

The correlations between the ranMngs given in the three 
columns are all reasonably high, as follows: (1) ria=0.94 
(2) r:3=0.84 (3) r23=0.96 

A second part of Somers’s investigation showed, however, 
that a better attitude toward examinations (at least those 
of the objective t3Tje) resulted firom a semester’s use of 
examinations as an integral and constructive part of in- 
struction. The increase in favorableness of attitude, al- 
though not so great as we might wish, was nevertheless 
significant. The final rankings of the eighteen activities 
in Somers’s experimental groups gave the various types 
of examinations positions as follows: true-false test, 10; 
matching test, 11; multiple-choice test, 12; oral quiz, 14; 
and the written examination and completion test tied for 
15th and 16th ranks. 

One final point should be noted about Somers’s study. 
The correlations between students’ and teachers’ attitudes 
were surprisingly high except in one case — ^the purposes and 
functions of written examinations. This discrepancy will 
be made clear firom the data presented at the top of the 
next page. 
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Scale Correlation 

Pleasant-tonpleasant 0.94 

Moral-immoral 0.99 

Valuable- worthless 0.91 

Purpose and function of: 

Redtations 0.95 

Oral quizzes 0.85 

Written examinations 0.39’ (0.61) 


It seems to be true that students and teachers differ greatly 
as to their notions of the purposes and functions of written 
examinations. The rank-orders are given in Table 19 for 
Somers’s study and its repetition by Gallagher and Ruch. 


TABLE 19 

Purposes and Functions of Written Examinations 


Purpose or Function 

(1) 

Rank Given 
BY Teachers 
(Somers) 

(2) 

Rank Given 
BY Students 
(Somers) 

Rank^iyen 
BY Students 
(Gallagher 
AND Ruch) 

Means of measuring ability to think. . 

1 

3 j 

3 

Basis for comparing achievement 

Basis for measuring students' knowl-i 

2 

4 

5 

edge 

3 

1 

1 

Means of stimulating reviewing 

4 

8 

2 

Means of stimulating studying 

5 

7 

4 

Opportunity for self-expression, etc. . 

6 

9 

9 

Means of measuring results of teaching 

7 

6 

7 

Means of stressing important facts . . . | 

8 

2 

6 

Basis for term marks * 

9 

5 

8 

Means of measuring memory ability. . 

10 

10 

10 

Numbers i 

45 i 

163 

177 


The intercorrelations of the three columns of Table 19 are: 
rut =0.61 (0.39 as given by Somers) r^ =0.78 ras = 0.56 
All in all, Somers’s conclusioiK that teachers and pupils 
hold substantially the same views on the pleasurableness of 
examinations but differ widely on the functions of examina- 
tions seem to be valid. 

iThis value is in error as given by Somers. According to the author's calculations it 
should be 0.61. 
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Hughes’s study of students’ attitudes toward examina* 
tions. Hughes, 1 in an investigation somewhat along the 
lines of that of Somers, reaches rather different conclusions. 
He gave to classes in “Problems of Democracy” examinations 
made up of the following parts: 

Part I. Limited recall (controlled answers but not completely 
objective) 

Part IL Specific definition or explanation (i. e., state meaning, define, 
or explain laws, terms, etc.) 

Part III. Completion exercises 

Part IV. Multiple-response items 

PartV. True-false (called by Hughes, “Alternate-Response”) 

Part VI. Essay examination 

He asked 157 pupils to record their attitudes toward these 
different types of questions upon a scale of seven points. 
The results are shown in Table 20. 

Point A shows that the essay test suffers badly in compari- 
son with the newer examination so far as pupils’ enjoyment 
of the examination period is concerned. Points B, C, and G, 
which deal with the measurement aspects of examinations, 
indicate that the objective tests inspire greater confidence 
as to their justice and accuracy; it is further to be noted 
that the combination of types is most to be desired. The 
essay examination is almost “out of the ruiming” on these 
points. The results on Point G (which is always dear to the 
hearts of pupils) are particularly significant. Perhaps the 
worst showing of the essay examination is on Point £>, where 
strong feeling is evidenced that the essay examination en- 
courages pure rote memorization of materials. Points £ 
and F are not commented on, as Hughes felt that the pupils 
did not fully understand the significance of these two issues, 
thus giving rise to appment contradictions with Point G. 


‘Unpublished Master's thesis, University of Pittsburgh. For a rfisum^, see C. A. 
Buckner and R. O. Hughes, “Testing Results in the Social Studies," Journal of the School 
of Education, University of Pittsburgh, Vol. I, No. 1 (^pt.-Oct., 1^5), pp. ^11. 



TABLE 20 

Evaluation of Attitudes op 157 Pupils Toward Different Types of Examination 
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Brinkley’s findings. Brinkley summarized his study of 
the preference of pupils for various types of examinations 
in the follovring table. ^ 


TABLE 21 

Pupil Preferences in Regard to Type of Examination 


Old Type vs. New Type 

No. OF Pupils 

Per Cent 

Essay Only 

22 

16 

Essay and New Type Combined. . 

67 

51 

New Type Only 

43 

33 

Total 

132 

100 

Choice op New-Type 

No. OF Pupils 

Percent 

True-false 

20 

20 

Multiple-choice 

71 

70 

Completion 

1 

1 

Word or Phrase Answer 

5 

5 

Arrangement 

4 

4 

Total 

1 101 

100 


Other studies. Several minor studies are brought to- 
gether here as a tabulation: 


Author 

No. OF Students 

Ratios of Prefer- 
ence FOR New- 
Type TO Old-Type 

Bardy* 1 

242 

95: 5 

Kinderf 

200+ 

97: 3 

KolstoeJ 

300 

89:11** 

Maytt 

260 

(Only two object- 
ed to new-type.) 


*J* Bardy, “An Investigation of the Written Examination as a Measure of Achievement 
with Particular Reference to General Science” (1923), University of Pennsylvania. 

tJ. S. Kinder, “Supplementing Our Examinations,” Education^ VoU 3dLv (1925), pp. 
557-566. 


IS. O. Kolstoe, “Reactions to True-False Tests,” School of EditctaUm Record, Universitv 
of North Dakota, VoL XI (1926), pp. 54-55. 

♦♦About one-ninth preferred essay-type, and about one-fourth opposed the true-false. 
ttM. A. May, “Measuring Achievement in Elementary Psychology and in Other Collese 
Subjects,” School and Society, Vol. XVU (1923), pp. 472-476 and 556-5^. 


^ G. Brinkley, Values of New-Type Examinations in the High School (New York, 
Columbia University, Teachers College Contributions to Education, No. 161, 1924), p. ICO. 
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Snmmary. The following comments will summarize the 
studies abstracted in this chapter: 

1. Somers has shown the relative unpopularity of exam- 
inations in comparison with other collegiate activities. 

2. Gallagher’s work seems to verify Somers’s general 
findings. 

3. Teachers and students seem to be agreed that exam- 
inations are relatively unpleasant tasks. 

4. Somers found that, as a result of experience, the new- 
t 3 rpe tests were at least slightly more favorably recdved than 
were the traditional examinations. 

5. Somers foimd that teachers and students differed 
widely in their views on the functions of examinations. 
Gallagher’s results were similar. 

6. Hughes and Buckner found positive evidence that 
students think the objective examination superior to the 
essay-type. 

7. Brinkley’s results indicate that a combination of old- 
and new-t 3 T)e examinations is the most desirable practice. 
This seems to be a reasonable conclusion. 

8. Bardy, Kinder, Kolstoe, and May find that prefer- 
ence for new vs. old types runs in the ratio of about nine 
to one. 

9. There are sufficient disagreements in the matter of 
pupils’ attitudes to warrant further study. 

10. If tests and examinations are impopular, does this not 
indicate the need for an attempt to integrate such measure- 
ments into the complete plan of instruction? Perhaps 
examinations could be made less irksome by making them 
more diagnostic and helpful to the student. It appears that 
examinations might be made to offer greater incentives. 



CHAPTER VI 


RELATIVE VALUES OF STANDARDIZED AND 
NON-STANDARDIZED TESTS 

Terms and definitions. Our vcx:abularies of educational 
measurement have not yet reached the point where all 
authors use terms with the same meanings. Such a com- 
mon expression as a standard or standardized test has little 
meaning. As the name implies, such a test is provided with 
norms or standards of achievement. 

If the provision for norms constitutes the sole difference 
between the informal (unstandardized) objective examina- 
tion (new-type test) and the so-called standard test, then 
the latter is nothing more than an objective or semi-objective 
examination with norms. Many well-known standard tests 
are in fact fairly described as more-or-less objective ex- 
aminations with norms. 

But a genuine standard test must meet far more stringent 
requirements than mere possession of norms. In truth, an 
otherwise well-constructed standard test, with no norms at 
all, would fulfill most of the functions of a standard test. 
Following the practice of most competent standard test- 
makers, a standard test will be defined for present purposes 
as one which: 

1. Has demonstrated validity resting upon some more 
secure basis than personal opinion. We have already listed 
(Chapter II) the principal methods of validating tests. It is 
further assumed that each individual item has been validated 
separately. 

2. Has demonstrated reliability. The principal means of 
assuring reliability to a test have already been described 
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(Chapter II). We should be able to assume that a sufficient 
experimental selection and elimination of “dead timber’' 
has been carried out to guarantee a reasonable approxima- 
tion to the situation of having the reliability as high as is 
possible in view of the number of test items which may ac- 
tually be included in the working time allowed. Allowances 
should be made, of course, for such factors as differences in 
subject-matter, relative economies and reliabilities of dif- 
ferent types of test items, etc. 

As a corollary of this point, it should be fair to insist that 
a standard test should not be placed upon the market if its 
reliability is so low that its uncritical acceptance by purchas- 
ers and users will result in gross mis-measurement of pupils. 
Unfortunately there are not a few standard tests, held rather 
generally in high repute, which yield reliabilites far below 
those ordinarily to be obtained by a thirty- to fifty-minute 
informal objective classroom test of the unstandardized 
variety. The use of such a test is probably a sheer waste 
of time and money in spite of the availability of norms. 
(See Tables 22 and 23, pages 143 and 144.) 

3. Has a reasonable degree of objectivity of scoring, in 
order that subjectivity will not react upon the reliability 
and consequently the validity of the test. 

4. Has norms or standards for evaluating the results 
obtained by the test. This requirement is less essential 
than the preceding ones, and much less essential than popular 
opinion suggests. Norms are uncertain quantities when we 
consider the enormous differences likely to be found in the 
same sdiool subject in such situations as: rural vs. city 
schools, differing courses of study, differing textbooks, 
differing methods of instruction, differing mental abilities 
of pupils, etc. Any norm is at best a mythical entity like the 
“average man,” the “typical pupil” etc. 

Dr. A. S. Otis has drawn up a suggestive rating scale for 
evaluating standard tests, which helps to define our concept 
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of what constitutes the genuine standard test.‘ This is 
given on the next page. In the opinion of the present writer 
Otis’s scale places too little weight upon validity and re- 
liability in comparison with administrative convenience. 

For a further discussion of the criteria of a good standard 
test, the reader is referred to Ruch and Stoddard, Tests and 
Measurements in High School Instrttction, especially pages 
45 to 68 and 301 to 375. 

How valid and reliable are most standard tests? It is 
very difficult to evaluate standard tests with respect to 
their general validity. In some cases their validity is much 
higher than the classroom teacher can hope to attain with 
informal objective tests. In general, however, the validity 
of most standard tests is open to discussion when we con- 
sider that local conditions vary so greatly. Against the 
fact that the standard test must fit a wide variety of local 
situations, we can set the fact that a well-made standard 
test represents a much more highly refined product than is 
possible with the test constructed without elaborate experi- 
mentation by a classroom teacher. It is probably fair to 
assert that a minority of the best standard tests more than 
compensate for their lack of perfect conformity to local 
conditions, but that the rank-and-file may be viewed with 
considerable suspicion. It is the conviction of the author 
that those schools employing rather extensive standard 
testing programs should seek to supplement their educational 
measvurements by locally constructed objective or new-type 
tests. Siich schools form the minority; the larger number 
may well afford to adopt a systonatic program of testing by 
standardized measures. 

Considerably more important than an extension of 
standard testing in the United States is a critical re-exam- 

^Test Service Bulletin, No. 13, ‘'Scale for Rating Tests,” (Yonkers-on-Hudson: World 
Book Company, 1926). 
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Names OF Tests 


SCALE FOR 
RATING TESTS 


Manual (5) 


Validity (15) 


Reliability (10) 


Reputation (5) 


Ease of Administration (Total 15) 

(a) Preparation (4) 


(b) Time limits (4) 


(c) Explanation needed (3) 


(d) Alternative forms (4) 


Ease of Scoring (Total 15) 

(a) Objectivity (10) 


(b) Time required (3) 


(c) Simplicity (2) 


Ease of Interpretation (Total 15) 
(a) Norms (5) 




(c) Class record (1) 


(d) Application of results (5) 


Convenient Packages (5) 


Typography and Makeup (5) 


Test Service (10) 


Total (100) 
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ination of the measures employed at present. There are a 
great many tests of excellent popular repute that have been 
outgrovm by the progress in educational measurement, 
and these should be abandoned in favor of more recent and 
more highly perfected measures. It is, of course, manifestly 
xmwise to refer to such tests by name although recent writers 
on educational measurement have published experimental 
evidence sufficient for the critical selection of adequate 
testing materials.* 

Recent books, such as those cited in the footnote, have 
shown the courage to present the advantages and limitations 
of specific tests in a critical fashion based upon actual use 
and statistical study. 

Monroe was the first to bring together any considerable 
amount of data on the reliabilities of well-known tests.* 
Table 22 gives his results, the average and median being 
inserted by the present author. The twenty-one well- 
known tests show an average reliability of between 0.67 
and 0.75, depending upon whether the average or median is 
chosen. Such a central tendency is surely disappointing. 
It diould be pointed out that Monroe’s list is far from a 
random sampling of tests, his list covering chiefly reading 
and arithmetic tests (many of them constructed some years 
ago), and the selections largely confined to publications of a 
sin^e company. Moreover, reading tests are very difficult 
of construction. Also, with a few notable exceptions (stand- 
ing near the top of the list), these tests are very short ones 
requiring firom five to fifteen minutes time. 

The author has gathered from his files 149 of his computa- 
tions of reliability coefficients for standard tests (Table 23). 

IT. h. Kelley, Interpretation of Educational Measurements (Yonkers-on-Hudson: The 
World Co., 1927), especially pages 214-348. 

P. M. Symonds, Measurement in Secondary Education (New York: The Macmillan 
Co., 1927). 

G. M. Ruch and G. D. Stoddard, Tests and Measurements in High School Instruction 
(Yonkers-on-Hudson: The World Book Co., 1^7). 

*W. S. Monroe, J. C. DeVoss and F. J. Kelly, Educational Tests and Measurements 
(Rev. ed., Boston: Houghton Mifflin Co., 1924), p. 42. 



NEW-TYPE VERSUS TRADITIONAL TESTS 


143 


Except for about a dozen determinations the tests are high- 
school and junior high-school tests, the latter being confined 
principally to the fields of geography, history, reading, and 
arithmetic. Such tests are also used in grades two to six 
in many instances. Table 23 presents these findings in a 
series of columns which segregate the tests into working 
times of varying lengths.* 


TABLE 22 

Reliabiuty Coefficients of Standardized Educational Tests 


TEST COEFFICIENT 

Illinois Intelligence General Intelligence Scaled 92 

Courtis Standard Research Tests, Series 87 

Brown Silent Reading Test— Rate 86 

Courtis Silent Reading Test No. 2 — ^Rate. .85 

Otis Group Intdligence Scale* .84 

Monroe Standardized Silent Reading Test, Revised^ — ^Rate 84 

Courtis Silent Reading Test No. 2— Comprehension— No. Quest .80 

Starch Silent Reading Test — Comprehension — ^Words 77 

Monroe General Survey Scale in Arithmetic^ ,76 

MonroeStandardized Silent ReadingTest Revised* — Comprehension ,76 

Monroe Standardized Silent Reading Test Revised* — Rate 75 

Monroe StandardizedSilentReadingTestRevised* — Comprehension .72 

Starch Silent Reading Test — Comprehension — ^Ideas 72 

Indiana Attainment Scale No. 1» 66 

Starch Silent Reading Test — ^Rate 62 

Pressey Primer Scale* 59 

Courtis Silent Reading Test No. 2 — Comprehension — Index 58 

Pressey First Grade Vocabulary Scale* .37 

Brown Silent Reading Test — Comprehension — Quantity .36 

Pressey Primer Scale* .33 

Brown Silent Reading Test— Comprehension— Quality .19 

Average “.67 
Median =.75 


^Walter S. Monroe, The Illinois Examination, p. 47. University of lUinois Bulletin, 
Vol. 19, No. 9, Bureau of Educational Research Bulletm, No. 6. (Urbana: Univeruty oi 
Illinois, 1921.) 

*L. W. Pressw, "A Group Scale of Intelligence for Use in the First Three Grades: Its 
Validity and Reliability,'* Journal of Educational Research, (Apr^L 1920) Vol. I, pp. 28^94. 
^Unpublished data of the Bureau of Educational Research, University of Illinois. 

<S. 5. Colvin, **Some Recent Results Obtained from the Otis Group Intelligence Scale,*' 
Journal of Educational Research (January, 1921), Vol. Ill, pp. 1-12. 
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TABLE 23 

Relation Between Reliability of Standaed Tests and Their 
Working-Time Limits 


(These data prindpally appear in Ruch and Stoddard, Tests and Meastere- 
ments in High School Instruction.) 


T 

Working Time Limits (In minutes) 

Total 

Averages 
OF Rows 

0- 

9 

10- 

19 

20- 

29 

so- 

so 

40- 

49 

so- 

so 

60- 

119 

120- 

179 

180- 

239 

.95 -.99 





2 


2 

1 


5 

83.5 

.90 -.94 


1 

7 


4 




1 

13 

44.1 

.85 -.89 


1 

6 

3 

5 

2 




17 

35.1 

.80 -.84 


3 

4 

6 

7 


1 



21. 

35.7 

.75 -.79 

2 

2 

3 

4 






11 

22.7 

.70 -.74 

4 

4 

2 

3 


1 




14 

20.2 

.65 -.69 

4 

4 

2 

1 






11 

14.5 

.60 -.64 

5 

1 

2 

3 






11 

15.4 

.55 -.59 

4 

3 


4 

1 





12 

20.3 

.50 -.54 

6 

2 

3 

3 


1 




15 

19.2 

.45 -.49 

3 



1 


1 




5 

20.5 

.40 -.44 

2 

2 








4 

9.5 

.35 -.39 

3 



1 






4 

12.0 

.30 -.34 

3 1 









3 

4.5 

.25 -.29 



1 







1 

24.5 

.20 -.24 

2 









2 

4,5 

Total 

\m 

23 

30 

29 

19 

5 

1 3 

1 

1 

149 


Averages' 





i 







of 

.55 

.68 

.77 

.69 

.86 

.69 

.92 

.97 

.92 

.694 


Columns, 













It is apparent at once that the short tests (less than 30 or 
40 minutes) are low in reliability as a rule, although there 
are many exceptions. The general correspondence with 
Monroe’s table is striking. Unlike Monroe’s table. Table 23 
includes a number of determinations of the reliability of a 
long “battery” of tests (the Stanford Achieoement Test). 
If results from this (somewhat more than two-hour test) 
were excluded, the agreement of averages would have been 
almost perfect. 

The results of Monroe and the author show rather con- 
clusively that short standard tests, with some exceptions, defeat 
one of their principal purposes, viz., reliable measurement. 
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Standard and informal objective tests compared and 
contrasted. The foregoing tables and discussion must not 
be interpreted as a disavowal of standard tests and testing. 
On the contrary, the author is a firm believer in the values 
of the standard test. If present remarks are construed as 
ultra-critical, and there is danger of this happening, the de- 
fense is that we have been making a plea for continued, and 
extended, use of standard tests which have been selected 
critically. There is no denying that the impersonal, national 
and normative nature of t^ standard test gives it a unique 
position in educational practices. It is unavoidable that 
the standard test can never hope to parallel aU educational 
conditions until such time as, if ever, there shall be reasonable 
uniformity of practice. In the long nm we shall come to 
know more and more definitely what elements in our curri- 
cula prepare for adult activities, pleasures, and outlooks. To 
the extent that this aim comes to be realized, to that extent 
and to that extent only can highly valid standsird tests be 
constructed. Reliability of measurement, on the other hand, 
waits upon no such final determinations. We can secure 
reasonable approaches to accurate measurement at the 
present time by means of the application of the current 
techniques of test construction. One thing is certain, viz., 
that we face a decision between continuing to use the five- 
to fifteen-minute test, with its resulting and certain limited 
reliability, and the adoption of more time-consuming 
measures, standardized or informal. We must abandon the 
thcHTOughly untenable position that time spent in testing is 
time wasted in teaching. Teaching and testing are aspects 
of the same process. It is further beside the point to claim 
that standard testing is too expensive. There hzis never 
been a case in the history of education where worth-while 
practices have been in the long run viewed, accepted, or 
rejected upon the sole basis of cost. In fact, the most 
inexpensive practices have invariably proved to be the most 
costly. School budgets always prove sufficiently elastic to 
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cover any costs which can be demonstrated to be profitable 
outlays of money. 

But, after all has been said and done, there are indisput- 
able limitations of the standard tests in the complete meas- 
urement program. They will always present some degree of 
alienation from the local education^ situation. To this 
extent they will require supplementing on the part of the 
classroom teacher. The only solution of this matter seems 
to rest in the locally constructed test, objective or otherwise. 
We are not yet ready to abandon completely the traditional 
examination. It has its imdeniable place. We must, how- 
ever, recognize its limitations, and continue also to point 
out the shortcomings of informal and standardized objective 
tests. It is doubtful whether standard tests can be made 
sufficiently detailed to provide constant diagnostic guidance 
to teacher and pupil when we consider the economic and 
commercial limitations imposed upon such measures. The 
immediate solution of our problem of educational measure- 
ment seems to lie in the combined and complementary use 
of standardized and locally-derived testing materials. 



Part II 

HOW TO CONSTRUCT AN OBJECTIVE 
EXAMINATION 




CHAPTER Vn 


THE BUILDING OF AN OBJECTIVE TEST 
OR EXAMINATION 

Analysis of the job. The general order of operations in 
constructing an objective test may be listed as follows: 

I. Drawing up a Table of Specifications 

II. Drafting the items in preliminary form 

III. Deciding upon the scope (length) 
rV. Editing and selecting the final items 

V. Rating the items for difficulty 

VI. Breaking the items into alternative forms 

VII. Rearranging the items in order of difficulty 

VIII. Preparing the instructions for the test 

IX. Making the answer keys or stencils 

X. Deciding upon rules for scoring 

Before entering upon a detailed discussion of these ten 
steps or operations in building a test or examination, some 
general justification is needed for certain of the steps in- 
cluded and for their order of appearance. 

It will be noted that the decision as to the scope and 
length of the test is made the third step rather than the 
first, as is commonly the case. At fitrst si^t this seems 
illogical. However, the length (number of items) needed 
in a test cannot be decided, a priori. It is only after the 
preliminary items have been written and the available 
number of good items has been ascertained, that it is possible 
to decide how many items are demanded for an adequate 
sampling of the subject-matter. It might be thought in 
advance that fifty items would cover a certain group of 
topics, but after the task of writing the items had been 

149 
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finished, it might be apparent that fifty items were too feMi 
to cover the ground thoroughly or that fifty worth-while 
items could not be constructed. For this reason it is 
recommended that the decision as to the exact length of 
the test be postponed until the items have been drafted 
in preliminary form. 

The sixth step is not absolutely essential, but it is a de- 
sirable extension of usual practice. The values of alterna- 
tive fcMms for tests have ^n mentioned before. The ad- 
vantages of duplicate forms for stabilizing grading purposes 
will be described in a later chapter. 

I. DRAWING UP A TABLE OF SPECIFICATIONS 

The term “Table of Specifications” was adopted for the 
sake of emphasizing the need for a general guide or skeleton 
in building a test. Such a table guards against the omisaon 
of essential items, the over-emphasis of minor topics, and 
improper balance of the sampling. The drawing-up of a 
working plan before drafting specific items goes a consider- 
able distance in establishing the validity of the final test 
when completed. 

The various steps in constructing an objective test may 
be introduced by an actual exaniple. 

As a more or less typical example, a six-weeks’ history 
test over the period previous to the Revolution was chosen. 
The materials covered during the six weeks represented 
largely the first six chapters, pp. 1-125, in Beard and Bagley, 
The History of the American People. The Table of Specifica- 
tions which was drawn up is shown on the next page. 

Several points should be noted about this specimen table. 
The major topics have been numbered with Roman numer- 
als. Each major topic is also given a key letter which stands 
as an abbreviation of the full topical statement. The key 
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Table of Specifications: The Pre-Revolutionary Period 


Vo. Topic 

1. Early trade routes and commerce 

{a) Transfer of center of commerce from the 
Mediterranean to the Atlantic 
(&) The trade with the Orient 
(c) Marco Polo and his influence 
id) The early navigators 
(e) The problem of a water route to the East 
(/) Aid of various monarchs to exploration 
(g) Approximate dates 

II. Famous navigators and explorers N 

(o) Columbus; life and ideas 

Q)) Spain’s aid to Columbus 

(c) The voyages of Columbus 

{d) Vasco da Gama 

(c) Amerigo Vespucci 

if) Magellan 

(g) Cortes and Mexico 

Qi) Conquest of Peru by Pizarro 

if) Ponce de Leon and De Soto 

(/) The French explorers 

{k) Cabot, Drake, and the English explorers 

(/) The Spanish Armada 

III. European conditions which led to the desire to 

explore and colonize E 


IV. The colonization of America C 


Key 

Letter 

T 


V. The struggle of European nations for supremacy 
in North America S 


VI. life in colonial America L 


Percentages 
of Items 
10 % 


10 % 


15% 

30% 

20 % 

15% 


Total 


100 % 
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letters are to be used for identification in sorting the items 
after they have been placed on cards (to be described later). 
Each major topic is followed by a stated percentage. For 
example, Topic I (key letter, T) is to contribute approxi- 
mately 10 per cent of the total number of items in the final 
test. These percentages are not assigned arbitrarily, but 
represent the teacher’s careful judgment as to the propor- 
tionate values of the several major topics to be covered. The 
percentages are left as such, and no attempt is to be made at 
this time to change these into actual nmnbers of items. This 
change can be made more intelligently under the third step 
in the analysis at the opening of the chapter. The Table of 
Specifications has not been completed, as it was thought 
ti^t enough detail was given to define the procedure. 

It should be noted that the sub-topics are lettered (a), 
(b), (c), etc. If desired, the ke 3 dng of test items .can be 
carried still further; e. g., items dealing with the “transfer 
of center of commerce from the Mediterranean to the At- 
lantic” might be keyed as T (a). No attempt has been 
made to assign percentages to sub-topics; these are to serve 
as reminders. To carry very far the assignment of percentages 
would defeat its own purposes by resulting in an impractica- 
ble and inflexible scheme which could not be followed. The 
sub-topics can be used in thinking about the worth of test 
items by asking oneself such questions as: “How many 
good items can I make for this subtopic?” “If I can ask but 
one question on this, what one thing is most important?” 
"How does this compare with that in importance?” etc. 

The particular scheme outlined above is not presented as 
the best possible, but it has been used many times by the 
author and his students, both in informal zuid standard 
objective-test construction. It is recommended that some 
such table be drawn up as a part of the validation of any 
important test to be constructed. 
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n. DRAFTING THE ITEMS IN PREUMINABT FORM 

With the Table of Specifications at hand, the next step is 
that of writing down tentative test items. In doing this, 
little attention need be paid to the percentages. Take 
each topic and sub-topic in turn and write out items which 
cover the “high points” of each. Do not spend much time 
refining the wording. The important tasks just now are: 

1. Covering the field thorou^y but at the same time 
avoiding trivial points; and 

2. Deciding which objective technique or type (true-false, 
completion, multiple-choice, matching, etc.) is best suited to 
handling the particular question in mind. 

In the end it is far more economical of time and labor to 
place each preliminary or tentative test item on a small card 
rather than to write these coi^ecutively on ordinary sheets 
of paper. Cards may then be rearranged, shuflled, discard- 
ed, inserted, etc., without necessitating any rewriting of other 
items. For this purpose 3x5 library cards are best; ruled 
if pen or pencil is us^, unruled if the items are typewritten. 
These cards should eadi contain: 

1. The key letter (to designate the topic) 

2. The test item (double spaced to allow for corrections) 

3. The indicated answer 

4. A temporary sequentuil number. (It is convenient to 
have this follow the key letter.) 

The samples below and on the next page are satisfactory: 

i-i 

The discovery of America vas part of a mighty historic 
movement which transferred the naval and cooimercial power 
from the Sea to the Ocean. 
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H-26 

John Cahot landed on the shores of 
Florida Virginia West Indies Labrador 


B-11 

After 1534, the Established Church of England 
embraced the Catholic religion. 

True False 


The preliininary phrasing of the test items should be done 
with reasonable care, although the main attention should be 
given to deciding the type of test 'technique which would be 
most satisfactory. The refining of phraseology can be done 
more economically at a later time when the final ntimbers of 
items needed have been decided upon. 

Pages 156-159 give a series of preliminary test items cover- 
ing the fibrst two main topics (T and N) of our Table of Speci- 
fications. They are written in sequence, but it should be 
remembered that each would appear on a 3 X 5 card according 
to the recommendations of this volume. 

One important rule might be laid down at this time: In 
framing preliminary test items, try to make up from 25 to 50 
per cent more items than your estimate indicates will be kept 
in the final test. This has two advantages: 

1. A great deal of “culling out” is then possible. 

2. An excess of items gives greater latitude in balancing 
the emphasis on the major topics and in m aking equivalent 
duplicate forms, if desired. 
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There is one very important aspect of the test at this 
stage of construction, viz., the choice of the types or test 
techniques to be employed. Teachers conamonly ask, “What 
is the best type to use, the true-false, the multiple-response, 
the completion, or some other?" This question is very 
difficult of answer. There are important differences between 
the several principal test types; these were listed in a pre- 
ceding chapter. These differences include matters of relia- 
bility, susceptibility to chance and guessing, adaptability to 
the subject-matter at hand, economy and objectivity of 
scoring, etc. Some of the more moot questions receive a 

rather thorough discussion in Part III (especially Chapters 
XI and XII) of this volume. For the present we can do 
little more than to make a few general comments: 

1. The t3rpe of test item (technique or mechanical form) 
should be decided principally upon the adaptability of that 
technique to the particular bit of subject-matter. It will 
be noted that the items which are given later in illustration 
of the building of an objective test in history are written 
down in mixed form: some true-false, some simple recall, 
some completion, some multiple-choice, and an occasional 
matching exercise. The decision as to which type to use in 
a given case is largely a matter of judgment and experience. 
Certain bits of subject-matter seem to fit themselves into 
one of the types; others do not lend themselves very readily 
to any of the types. In most cases, however, a combination 
of two or three of the common test types will handle every- 
thing which it is essential to include in the test. 

2. It is ordinarily imwise to leave the items in the scram- 
bled arrangement of the sixty-odd preliminary items shown 
on pages 156-159. It is preferable that all true-false items 
be assembled as one part or division of the test; the same 
being done for each of the other kinds of items employed. 
Thus, in a test employing true-false, completion, and match- 
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ing types, it would be best to divide the total test into three 
divisions or parts, one for each type of item, rather than to 
scramble all three techniques into one undivided test. 
Three sets of directions are needed in either case, and it is 
more systematic and less confusing to segregate the different 
types of items, each t3^e having its individual directions. 
Such segregation need not and, for reasons of economy, 
should not be done at this stage of test construction. 

For further guidance in handling the construction of the 
test at this stage, the reader is referred to the reviews of 
experimental studies given in Chapters XI and XII, Part III. 

The preliminary draft of potential test items follows. 


T. Early Trade Routes and Commerce 

T- 1. The discovery of America was part of a mighty historic move- 
ment which transferred naval and commercial power from the 
(Mediterranean) Sea to the (Atlantic) Ocean. 

T- 2. The Mediterranean Sea may be regarded as the '‘cradle” in which 
the great nations of the ancient world were "bom.” Tme False 

T- 3. The last of the great ancient nations was Greece. Tme False 

T- 4, During the Middle Ages (Italy) was the country which 

was the chief center of trade. 

T- 5. The interest in commercial things prevented Italy from developing 
much art. Tme False 

T- 6. Italy connected the markets of China and India with those of Paris 
and London. Tme False 

T- 7. During the Middle Ages the language of educated persons was 
(I^tin) 

T- 3. The power of ancient Rome and Greece gradually shifted to Spain, 
Portugal, France, and England. Tme False 

T- 9. The earliest Spanish and Portuguese navigators were chiefly inter- 
ested in finding lands across the Atlantic which could be colonized. 
Tme False 

T-10. The early navigators and merchants were seekingne w (trade routes) 
to (China) and (India) 

T-11. Rome feU in (476) A. D. 

T-12. The Crusades were pilgrimages to the Holy Land. Tme Fsflse 

T-13. The Crusades greatly stimulated trading with the countries of Asia. 
Tme False 
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T-14. Marco Polo vrzs a famous monk. True False 

T-15. Marco Polo lived for many years in (China) When he re- 
turned to Europe, he told many tales about the vast (riches) 
of the Orient. 

T-16. Marco Polo was one of the leaders of the Crusades. True False 

T-17. Little was known about the Far East before the birth of Columbus. 
True False 

T-18. The Italian geographers of 1450 A. D. thought that Asia could be 
reached by sailing around the southern point of (Africa) 

T-19. The invention of the (compass) was a great aid to navigation. 

T-20. Water routes to the Orient were found earlier than were good land 
routes. True False 

T-21. The land routes to India and China were unsatisfactory due to the 
slowness of travel, the danger from attack by (robbers) and 

the high tributes demanded by the (rulers) of the lands 

through which the merchants passed. 

T-22. When Rome ‘'fell,*’ the Roman Empire was invaded b y (barbarous) 
tribes from northern (Europe) 

T-23. Spain was at one time conquered by the Moors of Northern Africa. 
True False 

T-24. The kings of the former provinces of the Roman Empire, although 
often tyrannical, were less oppressive than the feudal lords. True 
False 

T-25. The kings of the various European countries did little to encourage 
the development of navigation and commerce. True False 

T-26. At the time of the discovery of America, Italy was a nation in name 
only, being in reality a collection of small dty and state govern- 
ments. True False 

T-27. It is believed that a Norseman by the name of Eric the Red first 
discovered America about 1000 A. D. True False 

T-28. Prince Henry of Portugal was a famous Toavigator) . 

T-29. The explorations of the Atlantic Ocean b^un by the Italians were 
carried on by the (Portuguese) , 

T-30. The southern point of Africa is called the Cape of (Good Hope) . 
The first to sail around this cape was a Portuguese 

seaman. 

T-31. Arrange the following events in order of their occurrence. Mark the 
first one 1, and the next 2, etc. 

( ) Landing of the Pilgrims 

( ) Plundering of Rome by the barbarians 

( ) Discovery of America by Columbus 

( ) Voyage of Eric, the Red 

( ) The Crusades 
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N. Famous Navigators and Explorers 

N- 1- Columbus was bom in Madrid Naples Genoa Florence 

N- 2. Japan was also known by the name of Palos Zipango San Sal- 
vador The A 2 ores 

N- 3. A famous poem celebrating the voyage of Columbus was written by 
the American poet, (Joaquin Miller) . 

N- 4. The first land sighted by Columbus was one of the (Bahama) 
Islands. 

N -5. Columbus sailed imder the flag of Portugal. True False 

N- 6. The long-sought water route to India was first found in 1497 by 
da Gama Columbus Diaz Balboa 

N- 7. The name ‘‘America” comes from the name of an Italian sea cap- 
tain, (Amerigo) (Vespucci) . 

N- 8. Columbus died in ignorance of the fact that he had discovered a 
new world. True False 

N- 9. Pinzon was the first white man to see the Padfic Ocean. True 

Fa^ 

N-10. (Magellan) was the first to reach the Pacific Ocean directly by 
sailing across the Atlantic. 

N-11. Magellan was killed in a fight with the natives of the Philippine 
Islands. True False 

N-12. The Strait at the southern end of South America was named for 
Columbus Balboa da Gama Magellan 

N-13. A Spaniard by the name of (&>r^) discovered the country 
of Mexico. 

N-14. The first country discovered by the Spaniards which really had the 
much sought-for riches was _ (Mexico) 

N-15. Cortes treatment of Mexico may be described as kindly advisory 
robbery co-operative 

N-16. The early missionaries foimd that the Mexicans were adherents to 
the Catholic religion. True False 

N-17. The ruler of Mexico at the time of the visit of Cortes was 
(Montezuma) , 

N-18. Cortes found a very low form of civilization when he visited Mexico, 
True False 

N-19. The conquest of Peru was led by Cortes Pizarro De Soto 
Ponce de Leon 

N-20. Pizarro’s treatment was very much like that of Cortes for Mexico. 
True False 

N-21. Ponce de Leon and De Soto were less fortunate in finding riches in 
Florida than were Pizarro and Cortes in South America. True 
False 
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N-22. De Soto finally reached the (Mississippi) River where he 
(died) 

N-23. The Southwest was explored first by De Soto Coronado Cortes 
Verrazano 

N-24, The exploration of the SL Lawrence River and surrounding territory 
was first undertaken by (Cartier) and (Champlain) . 
N-25. The last of the great nations of Europe to join the exploration of the 
New World was England France Spain Portu^ 

N-26- John Cabot landed on the shores of Florida Virginia West 
Indies Labrador 

N-27. Sir Francis Drake may be described as a coward pirate states- 
man colonizer 

N-28. The Spanish fleet was known as the (Armada) 

N-29. The breakdown of Spain's rule of the sea was accomplished by a 
great (naval) defeat by the shii)s of (England) 

N-30. Number the following events 1, 2, 3, etc., in the order in which they 
occurred. 

( ) Conquest of Spain and Peru 

( ) Defeat of the Spanish Armada 

( ) Columbus' first voyage to America 

( ) Marco Polo's travels 

( ) Invention of the mariner’s compass 

m. DECIDING UPON THE LENGTH OF THE TEST 

Judging roughly from the number of items 3 delded by the 
first two chapters of Beard and Bagley (about thirty i)ages), 
it appeared that it was quite feasible to make at least 
preliminary items on the period prior to the Revolutionary 
War. Allowing a shrinkage of fifty items (more or less), 
it is estimated that 200 suitable items might be secured, if 
needed. These 200 would exhaust the subject rather 
thoroughly. It would thus be possible to make a test of 
200 items, or two forms of the same test with 100 items each. 

The TTnaking of two fomas is to be prefoxed over a single 
longer form for these reasons: 

1. If it is thought advisable to give 200 items, both forms 
can be admimstered. 

2. It is almost certain that a few pupils will be absent 
when the test is given. To use a second form as a “make- 
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Up” (if one is available) will be fairer to all, provided the 
two forms are almost exactly equal in difficulty. 

3. The second form can be used for re-tests on pupils who 
wish to “make up” their low grades on the regular test. 

4. The two forms may be used in rotation, year after year, 
and thus provide a basis for comparing successive classes 
without serious danger of coaching or cramming effects. 

The discussion will assume from now on that two forms of 
the test will be made, each form having 100 items. It shoxild 
be noted that this decision as to the length of the test (in 
terms of numbers of items) was made after information was 
at hand as to such facts as {a) approximate number of worth- 
while items which could be made, and {b) the numbers 
needed to cover the subject thoroughly. 

IV. EDITING AND SELECTING THE FINAL ITEMS 

This is the “culling out” stage of the test. It is performed, 
preferably, a day or two after the preliminary drafts of the 
items were made, in order to edit and revise these rough 
statements with a fresh and critical mind. This editorial 
stage is by far the most critical step in the construction of 
the test, unless we except the second operation (drawing up 
the preliminary items and selecting the test tj^ie to use). 

The teacher must scrutinize each test item much as an 
editor criticizes every line of an important manuscript. The 
test-maker should put himself, so far as possible, in the 
attitude of the pupil. Try to misread the meaning to see il 
there are possible misinterpretations that would mislead or 
prove ambiguous. Keep in mind that good sentence structurt 
is a prime requisite for a valid test item. See if an easiei 
synonym can be found for any difficult word or term. Set 
that the punctuation is such as to assist in making the intenl 
of the test item clear. In types of tests like the completior 
or simple-recall, write, down every answer that you car 
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think of as likdy to be given by a pupil. List those for 
which you will give full credit, likewise those for which no 
credit wiU be given. (Avoid giving half-credits.) 

There is little that can be said in words which will guide 
the test-maker at this stage. Ejtperience is the only safe 
guide in the long run. 

It may help to clarify procedures to study in some detail 
a few of the preliminary drafts of items as given on pages 
156-159 for the projected history test. 

Item T-1. This item seems to be unambiguous. It 
probably should be retained as it summarizes what might be 
thought of as one of the great world movements of all history. 
It is more than a fact question; it expresses a broad concept 
of the sweep of historic events. 

Item T-2. This is similar to the preceding item. It calls 
for thought, as neither the wording nor the fact is stated in 
even similar form in the text. Keep this one. 

Item T-3. This is more nearly a fact question. It is, 
however, an important question likdy to catch the pupil 
who is “cloudy” on ancient history. A child mi^t know 
that Greece and Rome were both great nations but be una- 
ware that Rome was the conqueror of Greece and the suc- 
cessor of the latter as the ruler of the world. 

Item T-4. This seems to be a valid item. It is l^gdy a 
matter of fact, but it bears importantly upon the movements 
which led to the discovery of America. 

Item T-5. Less important, perhaps. Might go out if an 
excess of items is found to be the case. It is phr^ed to catch 
the not-too-alert child. 

Item T-6. Should be discarded or revised. The word 
“connected” is too abstract a phraseology for dementary 
school pupils. Dull pupils might think of the matter as a 
physical connection. It might be improved as follows; 
“Italian shipping connected . . . etc.” 
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Item T-7. This is fact — ^but surely an important one. 

Item T-8. Another item similar to T-1 and T-2, but of 
considerable importance in the broad outlines of history. 

Item T-9. Pupils often confuse the earlier period of the 
search for riches with the later colonization movement. 
This item will help to show up such a confusion. 

Item T-10. A good item except that the first blank (trade 
routes) will didt numerous doubtful responses. The idea 
of trade routes seems to be a unitary idea, and the acceptance 
or rejection of responses other than the one marked as ac- 
ceptable will not be difficult in most cases. 

Note that in such items, either order for “China” and 
“India” is acceptable. This is a general rule covering all 
such cases. “America” should be marked wrong. “Japan” 
is acceptable. 

Item T-11. This is not important enough to demand the 
exact year (on the part of grammar-school pupils) . Discard, 
or change to “Rome fell between 400 A. d. and 500 A. D.,” 
or some less exact statement. 

Item T-12. This should be valid. 

Item T-14. Probably should not be used. It is far- 
fetched. It looks like a premeditated attempt to “fool” the 
pupil. It would be less objectionable in the form: “Marco 
Polo was a famous monarch monk traveler general.” 

Item- T-15. The first blank is satisfactory. The second 
blank may elidt some responses which will be hard to score. 
Keep or discard according to final needs. 

Item T-19. Certain other nautical instruments may oc- 
casionally be mentioned, but these should receive full credit. 

Item T-21. Not of the greatest importance, but a factor 
in stimulating the search for less hazardous routes. ‘ ‘Pirates’ ’ 
will be given instead of “robbers,” but such a response is 
worthy of credit. Likewise “princes,” “kings,” etc., will be 
given on the second blank. These should receive full credit. 

Item T-23. Unimportant for present purposes. Discard. 
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Item T-24. Not important for purposes of imderstanding 
the backgroxmds of American history; better reserved for 
the course in ancient history in the hi^ school. 

Item T-26. Similar to T-23 and T-24. 

Item T-28. Unimportant. A better form, perhaps, 
would be: “Prince Henry of Portugal was known as toe 
‘Navigator.’ ” True F^se 

This change, however, does not increase its importance. 

Item T-31. Such exercises are valuable if the chronologi- 
cal decisions are not too close. The present items are not 
objectionably close together in historical sequence. If two 
forms of a test are to be made, several more such matching 
tests should be prepared so that there are at least two or 
three such exercises in each form of the test. Otherwise it 
will be wasteful to write directions for and to administer a 
single five-item matching test. 

The foregoing comments may or may not help toe reader. 
They do typify, to some extent at least, toe frame of mind 
which toe test-maker must assume in criticizing his work. 

Since it was decided to make two forms of this test with 
100 items per form, it is necessary to eliminate eleven items 
if we follow toe percentage allowance of the Table of Speci- 
fications. (I. e., 10% of 200 = 20.) The author will not 
attempt to do this. It is sufficient to note that there is 
somewhat more than a fifty per cent excess of items falling 
imder this main topic. This should guarantee reasonably 
well that the twenty finally selected are important and valid. 

V. RATING THE ITEMS FOB DIFFICULTY 

The advantage of having toe items of a test in increasing 
order of difficulty was discussed in Chapter II. The pro- 
cedure for such ratings is simple, but difficult in toe sense 
that such ratings are not very acciuate since they are at 
best highly subjective estimates. 
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The rating may be done on either a 5- or 10-point scale, 
the former probably being fully as accurate as the latter. 
Whichever scale is used, the procedure is as follows: 

1. Rate “1” those items which are so easy that all or 
nearly all of the pupils may be expected to answer them 
correctly. 

2. Rate “5” (or “10” depending upon the scale used) 
those items which you think will be failed by all or nearly 
all of the pupils. 

3. Assign the intermediate ratings (“2” to “4,” or “2” 
to “9”) to those intermediate in difficulty, so far as you are 
able to distribute the ratings by approximately equal 
intervals. 

4. Write the ratings on the item cards. (Note: if two or 
more teachers co-operate in rating the items, be sure to use 
the same scale. In this case, place the ratings on the back 
of the card, each teacher being assigned a comer for her 
rating and being cautioned to make her rating before turning 
over the card and thus exposing to view any previously 
recorded ratings.) 

It is assumed that all eliminations to the desired numbers 
have been made before the ratings are carried out. The 
rating may be done first, however; this procedure has the 
advantage of making possible the elimination at once of any 
excess number of items rated as too easy or too difficult. 

VL BREAKING THE ITEMS INTO EQUIVALENT FORMS 

The history test (used here as an illustration) may be 
constructed in two rou^y equivalent forms as follows: 

1. Throw out all doubtful, too-easy, too-diflScult, or 
otherwise unsatisfactory items imtil the numbers shown by 
the Table of Specifications are approirimated. (This is 
really Step IV in our outline, and if these eliminations have 
already been made, there is nothing further to be done about 
selecting the items.) 
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2. Deal the items into two (or more, as the choice may be) 
piles ejcactly as playing cards would be dealt. The intention 
here is to equalize the forms through the law of chance. 
The jSnal items are still in topical arrangement before the 
process of breaking into duplicate forms is begun. There- 
fore, if the items of the first topic are thus subdivided, then 
the next topic, etc., each form will receive approximately 
equal numbers (eqiml samplings) of each major topic (such 
as were indicated by the key letters T, N, E, etc., in our 
illustration). 

Another procedure which accomplishes the same purpose 
would be as follows: 

1. After the eliminations to the final numbers have been 
made, renumber the items consecutively, beginning with the 
first item of the first main tcq)ic and proceeding through the 
other main topics in order. 

2. Throw the odd-numbered items (Nos. 1, 3, 5, 7, etc.) 
into one form and the even-numbered items (Nos. 2, 4, 6, 8, 
etc.) into the second form. If three forms are to be made, 
the procedure is obviously slightly different, thus: 

Form A Form B Form C 

Item 1 Item 2 Item 3 

Item 4 Item 5 Item 6 

Etc.i 

^It should be noted carefully that we are considering the case where items are arranged 
topically and not in order of tncreasir^ diffi^Uy. The breaking into equivalent forms is a 
different matter in the latter case. If, as is often the case in standard test construction, 
the items are arranged in order of difficmty b^ore they are broken into forms, the procedure 
would be: 

For Making 2 Forms For Making 3 Forms 


Forma 

Form B 

Fc«m A 

Form B 

Form C 

Item 1 

Item 2 

Item 1 

Item 2 

Item 3 

Item 4 

Item 3 

Item 6 

Item 5 

Item 4 

Item 5 

Item 6 

Item 7 

Item 8 

Item 9 

Items 

Item? 

Item 12 

Item 11 

Item 10 


Etc- Etc. 


This plan should be studied carefully, as it is designed to prevent systematic differences 
in difficulty of the different forms. 

The differences between the dealing of (1) items in diance order of difficulty and (2) 
those m increasing order of difficulty are exactly analogous to (a) the usual dealing of shufffed 
playing cards and (b) dealing in turn of a of cards which have been first arranged in 
order of wcrtfosmg value- Under (fc) player No. 2 would always receive a slightly oetto: 
card than player No. 1, and his resulting hand would be systematically better than that of 
No. 1. 
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It should be noted that assignment by chance to duplicate 
forms can only be depended upon whmjairly large numbers of 
items are involved. The breaking of 100 items, by’chance, 
into two forms of fifty items each may be expected to yield 
forms that differ not more than firom two to five points in 
average difficulty; occasionally the difference will be more, 
but more often less. When 200 items are broken into two 
forms of 100 each, the expected variation in difficulty in the 
resulting forms would be relatively less. In any case, if 
fix)m 100 to 200 items are broken into two forms by chance, 
the resulting inequality of forms will be markedly less than 
if successive examinations are constructed de novo each year. 

Vn. BEAKRANGING THE ITEMS IN OBDEB OF DDTICULTY 

If the items of the test (or of the individual forms) have 
already received difficulty ratings, it is a simple matter to 
rearrange them in increasing order of difficulty. It has 
already been pointed out that this step, if taken, increase 
the reliability of the test through such means as increased 
motivation, a better distribution of time and effort on the 
part of the pupils, etc. 

Vm. PREPARING THE INSTRUCTIONS FOR THE TEST 

The variations in the instructions given for objective 
tests are legion. Most test authors have their preferred 
forms of stating such directions. The important things 
about test instructions are clarity, fullness, and brevity 
consistent with the first two requirements. 

The amount of detail needed will depend largdy upon (a) 
the familiarity of the pupils with tests of the t 3 q)e being 
given, and (b) the ages or mentality of the pupils. In 
grades two to four or five it is decidedly better to use the 
method of written instructions which are read silently by the 
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members of the class while the teacher (or other examiner) 
reads the instructions aloud. After the pupils have become 
“test wise” from repeated contacts with objective or stan- 
dard tests, the instructions may be abbreviated except 
where new test techniques are employed. 

The following general rules may help to emphasize the 
significant features of a good set of directions: 

1. In writing instructions, phrase the directions so as to 
meet the level of the lowest mentalities in the group. 

2. Use the simplest synon 3 nns for all words or ideas. 
With very yoimg children it may occasionally be necessary 
to sacrifice grammatical construction in the interest of clarity 
e. g., in a two-choice test it is probably permissible to in- 
struct pupils to select the “best” answer rather than the 
“better,” since “best” is the normal expression of young 
children. Critics occasionally object to calling test items 
(which are usually complete or incomplete declarative 
sentences) “questions.” This criticism is beside the point 
with very young pupils. Moreover, in the traditional 
examination, as well, the word “question” has always 
included both imperative and interrogative sentences. 

3. Be generous in the use of samples, especially with young 
or backward pupils. The examination is a measure of 
achievement, not of ability to understand and to follow 
directions. In the case of certain standard tests many of 
the zaro and very low scores arise solely from inadequate 
instructions, the inadequacies being chiefly imdue brevity 
and adult terminology. 

4. Where the test technique is complicated, such as is the 
case with many matdiing tests, multiple-response tests with 
numbered alternatives, some kinds of cross-out tests, etc., 
use a fore-exerdse or practice test to supplement the printed 
or verbal directions. The National Intelligence Test is a 
good example of the use of fore-exerdses in a standard test. 
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5. The instructions should direct the pupil where and how 
to record his answers. The samples should also show the 
same facts. The pupils should be told whether to hurry or 
to work slowly and carefully. If the test is timed, the pupils 
should be told in advance what the time allowance is. 

6. It is gradually being conceded that the instructions 
should inform the pupil about the answering of doubtful 
and unknown questions. De Graff and Ruch have found 
some evidence (Chapters XI and XII) that it is more valid 
to instruct pupils not to guess when the answering reduces 
to pure guessing. 

In cases of doubt, but where the pupil has some “htmch” 
or inkling as to the probably correct answer, it seems better 
to allow him to follow his “hunches.” The experimental 
evidence on this point is rather meager and authorities differ. 
Dr. Ben D. Wood has always used instructions against pure 
guessing. Dr. W. A. McCall has taken the other point of 
view upon the theory that the more the guessing, the more 
adequate the statistical correction for guessing or chance. 
The work of DeGraff and Ruch tends to support Wood’s 
position rather than that of McCall, although the issue is 
not as important as some seem to think. Teachers have 
objected to encouraging guessing as a matter of bad habit 
formation. 

The following sets of test directions are thought to be 
reasonably adequate in the main. 

1. True-False 

Below are 50 true-false statements. About one-half of them are true 
and about one-half are false. Read each statement carefully. If you think 
it is true, draw a line under “true.” If you think it is false, draw a line 
under “false.” 

Take each question in order, but do not waste too much time on one 
that you do not know. Skip it and go on to the next Do not guess! 

If you have any time left you may go back and work on those you 
left out. 
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Ask no questions after the signal to “Go” is given. 

Study the samples carefully before beginning actual work. 

[Samples follow] 

In case the words “true” and “false” do not appear after 
each item, it is customary to place a dotted line dither at the 
extreme left or right of the item. The pupil is then instruct- 
ed to record his judgments by some such markings as: 

If true If false 


(a) T F 

(b) True False 

(c) Yes No 

(d) -}- 0 

(e) + 


Of these (a) is not very satisfactory, for two reasons: (1) T 
and F look too much alike for accurate and rapid scoring, 
and (2) when pupils correct their own tests, it is too tempting 
to change the T into an F by adding a short mark. 

Numbers (d) and (e) seem the best both for speed and 
clarity. There is little to choose between the two. High- 
school pupils who have studied algebra may feel that the 
plus and minus method is slightly more meaningful. When 
self-correction is allowed, (d) is greatly to be preferred over 

(e) since pupils can diange — to too easily during the 
process of correcting the papers. 

2. Multiple-choice 

There are five possible words given for completing: each incomplete 
statement below. Only one of these words makes the statenaent true. 

Read each question carefully, decide which word makes the truest 
completion, and then draw a line tmder that word, as ^own in the samples. 

If you do not know the answer to any question, do not waste time mi it, 
but go on to the next Do not hurry, as there will be enough time for all 
tofini^ 

Look at the three samples before you begin to work. 

[Samples follow] 
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When the multiple-choice test requires responding by 
writing the number of the correct response, the wording 
might be: 

Below are 75 incomplete statements. Five words or phrases are given 
after each statement. One of these five words or phrases makes the state- 
ment true; the other four are incorrect. 

Read each incomplete statement carefully, decide which of the five 
possible words or phrases makes the truest sentence, then write the Number 
(not the word or phrase itself) on the dotted line at the right, as shown in 
the samples. 

Samples: 

(а) The best temperature for living rooms is (1) 50® (2) 60® 3 

(3)68® (4)75® (5)78® 

(б) The blood is pump^ by the (1) liver (2) lungs (3) 5 

stomach (4) veins (5) heart 

Begin here. 


3. Completion 

Certain words have been left out in the sentences or paragraphs given 
below. Dotted lines show where the words are left out In most cases 
just one word has been left out 

You are to write, on the dotted lines, the words which have been left 
out The three samples show you how to do this test 

[Samples follow] 

Do the rest of the sentences or paragraphs in exactly the same way as 
shown in the samples. 

The Paragraph Meaning test (a completion test) of the 
Stanford Achievement Test gives a set of directions which 
have proved to be adequate in high-first and low-second 
grades. This test uses the double method of spoken direc- 
tions by the teacher and silent reading by the class. 

Read the w(Rds at the top of the page, here. (Hold up booklet and 
point to the sample sentence.) It says (read slowly) : Dick and Tom were 
playing ball in the field. Dick was throwing the ball and ..... (pause) 
was trying to catch it. Who was trying to catch the ball? (Encourage 
pupils to answer aloud.) As soon as correct answer is given, say: Yes, 
Tom was trying to catch it. Yon must write Tom on the dott^ line. 
(Pause until word is written.) 
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Wherever yon see a dotted line on these two pages, it means that a 
word has been left out. Begin with No. 1, read each sentence carefully, 
and write JUST ONE WORD on each dotted line to show what has been 
left out. When you have finished the first page, go right on to the second 
page. Ready — Go. 

The pupil’s test booklet (which is before the child as the 
teacher gives the above directions) looks like the following: 

Stanf. Adv. Exam. A 

TEST 1. READING: PARAGRAPH MEANING 

Sample: Dick and Tom were playing ball in the field. Dick was throwing 

the ball and was trying to catch it. 

Write JUST ONE WORD on each dotted line. 


1. Fanny has a little red hen. Every day the hen goes to her loest and lays 

an egg for Fanny to eat. Then she makes a funny noise to tell Fanny 
to come and get the 

2. A kitten can dimb a tree, but a dog cannot. This is very lucky for 

Nellie’s kitten. Every time Joe’s big dog comes along the kitten dimbs 
a tree and the cannot follow. 

3. Etc. 

4. Matching Tests 

The column at the left below gives the names of ten men. The cdiimm 
at the right gives ten events connected mth the names of the ten meiL 
Look at each event in the column at the right ; then find the man in the 
column at the left who is connected with that event. 

Place the Number inoi the name) of the man in fiT>nt of the event with 
which he is assodated. The first one is already done correctly as a sample. 


Men Answer Events 

1. George Washington Inventor of the cotton gin 

2. Harvey W. Wiley Louisiana Purchase 

3. Thomas Jefferson Forest conservation 

4. Robert E. Lee Development of banking syston 

5. William J. Bryan Pure food laws 

6. Alexander Hamilton President of the Confederacy 

7. Eli Whitney First President of the U. S. 

8. Robert Fulton Advocate of free diver 

9. Gifford Pinchot Inventor of the steamboat 

10. Jefferson Davis Commander of the Southern armies 
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In case of incomplete matching (where one column con- 
tains an excess of terms in order further to reduce chance 
successes), certain changes will have to be made in the above 
instructions. 

These suggested instructions for four of the principal types 
of objective tests must suffice for presenting the fundamental 
principles in phrasing directions to the pupils. The exact 
phraseology is of small importance; the main issues being 
clarity, completeness, and the generous use of samples to 
supplement the written instructions. With young children, 
reading in unison and marking sample items imder the 
direction of the examiner are most helpful. 

EL MAEIKG THE ANSWER EETS OR STENCILS 

The choice of an economical answer key or stencil depends 
principally upon (a) the nature of the test to be scored and 
(6) the number of tests to be scored. 

The labor of scoring may be made very small indeed if 
the needs of economical scoring are kept in mind when the 
test is planned. Certain tests like the multiple-response 
may be quite laborious in scoring even when the most con- 
venient scoring devices are provided, but may be planned 
so as to obviate from fifty to ninety per cent of such labor 
by better arrangements of the test items proper. 

We need to recognize two principal types of devices for 
indicating responses, as follows: 

1. Aligned response columns, usually vertical in position. 

Snowbound was written by 

The date of the birth of Shakespeare was 

2. Staggered response blanks (or positions), e. g., 

The purified blood returning to the heart fiom the enters 


the auricle and from there it passes through the 

valves into the , etc. 
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One of the principal products of China is com gold wheat tea 
ironwood 

Ohio is bounded on the west by Missouri Indiana Iowa Illinois 
Michigan 

Aligned responses are always more economical and are 
always possible with simple recall, matching, and true-false 
tests. They are often possible with multiple-choice tests 
(when the method of response is by number, not underlining) . 
The force of these suggestions may be felt by studying the 
following fragments of tests. 


I. 

1. A little is a dangerous thing. — Pope, 

2. To me the flower that blows can give 

Thoughts that do often lie too deep for tears. — Wordsworth. 

3. Honor and shame from no condition rise; 

there all the honor lies.— 

II. 

1. The formula for sulphuric acid is HgSOi 

2. Chlorine belongs to the group known as Halogens 

3. Alkaline solutions turn phenolphthalein 

III. 

1. The mosquito is known to carry typhoid malaria small-pox yellow 
fever 

2. The most important class of foods for tissue-repair are the protdds 
minerals feits carbohydrates 

3. The trunk is divided into two main cavities by means of the ribs 
cBSOphagus trachea diaphragm 

IV. 


1. The American Revolution began in (1) 1762 (2) 1775 
(3) 1783 (4) 1789 (5) 1812 

2. Cornwallis surrendered at (1) Yorktown (2) Jamestown 
(3) Saratoga (4) Appomattox (5) Valley Forge 

3. The President of the Confederacy was (1) Lee 

(2) '^Stonewall'' Jackson (3) Thomas (4) McQdlan 
(5) Jefferson Davis 
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V. 


1. All bacteria are injurious. True False 

2. The bodies of all plants and animals are made up of cells. True False 

3. The chief use of the red blood corpuscles is to kill disease germs in the 
Mood. True False 


VI. 


1. Stevenson mote Treasure Island. 

True 

False 

2. Dickens was a writer of lyric poetry. 

True 

False 

3. She Stoops to Conquer was written by Byron. 

True 

False 

VII. 



1. A straight line is the shortest distance between two points. 

± 

2. In the expression 37y*, y is an exponent. 



3. If X equals —3. 




These seven examples are all common practices. The 
advantages and limitations of each call for brief comment. 

No. I is not economical of scoring. It is about as good a 
device as can be adopted for such a test, however. It will 
require several times as long per blank as No. II, but it 
probably cannot be improved greatly without grave com- 
plications. 

No. II is usually termed “simple-recall,” and is a form of 
completion test. It is the most rapidly scorable t3q)e of 
completion test. Where conditions permit, it is to be pre- 
ferr^ over No. I. Note, however, that teachers often place 
the terminal blanks immediately after the close of the state- 
ment. This makes the blanks occur in a staggered arrange- 
ment very xinsatisfactory for scoring. The recommendation 
is to align the blanks in the simple-recall tests, even if the 
statements vary greatly in length. If desired, there can be 

hyphen leaders ( ) or dots ( ) inserted between 

the end of the statement and the response lines, thus* 

The formula for sulphuric acid is H2SO4 

Chlorine belongs to the group known as Jalogens 
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In tj^pewritten material the best practice is dots with a 
solid line for the response. If the material is to be set in 
type, dot leaders should be used with hyphen leaders for 
the response line. 

No. Ill is to be compared with No. IV. No. Ill is simpler, 
and perhaps better adapted to children below the sixth grade 
because more simple instructions may be written. No. IV 
is by all means more desirable from the standpoint of econ- 
omy of scoring. The device employed in No. IV was in- 
vented by Dr. Arthur S. Otis, and represents the most 
rapidly scored multiple-choice test technique ever developed. 
A variation is the use of (a), (&), (c), etc., instead of (1), 
(2), (3), etc., for labelling the responses. Letters have the 
advantage in tests in mathematics or history information 
where numbers (dates) occur frequently. This plan of 
numbered (or lettered) responses occasionally leads to errm: 
when the pupil selects the right response but writes the 
wrong number (or letter). For this reason it is chiefly 
adapted to high-school and college levels. 

Nos. V, VI, and VII should be compared. No. V is 
markedly inferior to the other two. No. V has no advantage 
except '&e very slight one (Cf. Nos. Ill and IV in this 
respect) that, in the case of Nos. VI and VII, there is some 
danger of the eye failing to “carry” out to the correct blank 
or response word. The answer is consequently misplaced 
one or more blanks either up or down. Scorers should be on 
their guard against this. No, VII is faster than No. VI, 
but the difference is not great enough in many circumstances 
to make it a vital factor. Note that -1- and 0 are often used 
instead of -H and — to indicate true and false, respectively. 
Sometimes the method of responding is by writing “T” 
and “F” on the blanks. This is probably inferior to plus 
and minus. Both the + and — and the T and F (written in) 
methods are not well adapted to having pupils correct their 
own papers. It is too easy to change a — into a -1- or a T 
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into an F when correcting the papers. The use of -1- and 0 
is somewhat better for self-correction; although underlining 
is also reasonably satisfactory. Many test workers prefer 
to have the -f-, — , or 0 placed between the item number 
(at the left) and the first word of the statement. 

A few minutes study of the samples of test techniques 
given in Chapters VIII and IX will reveal which ones are 
best adapted to rapid, accurate, and easy scoring. 

We can now classify answer keys and stencils as follows: 

1. Strip keys for aligned vertical columns of response 
blanks or response words in such tests as: 

ifl) Simple recall 

{V) Numbered multiple-choice 

(c) Matching 

(d) True-false (especially the -1- the -fO, or the writing 
of T and F, etc.) 

2. Transparent cdluloid or tissue-paper stencils for such 
tests as: 

(o) Unnumbered staggered mriltiple-response 

(ft) True-false, yes-no, same-opposite, etc., when imder- 
lined 

3. Cut-out stencils for such tests as: 

(a) Staggered (ordinary) completion 

(ft) Staggered computation 

4. Answer sheets for reference (not to be superimposed or 
aligned directly on the test sheets). These are sometimes 
used for almost any variety of tests. 

1. The Strip StencO. Fig. 8 shows a strip stencil applied 
to a page of test material of the numbered multiple-response 
t3q)e. The stencil is merely a strip of heavy paper or card- 
board from one-half to one inch wide and the length of the 
test page. The easiest way of making such a strip stencil 
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Section 1 

ENGLISH, LITERATURE, ANl 

1« Snowbound was written by (1) F 
(3) Whittier (4) Tennyson (5) Kiplin 
2. The Gettysburg Address was given by 

(2) Daniel Webster (3) General Grant 
(5) Thomas Jefferson 

3m The Pied Piper ridded Hamelin of 

(3) rats (4) frogs (5) beggars 

4. The god who held up the heavens was 
cules (3) Odysseus (4) Mercury (i 

6. The best known work of Coleridge : 
(2) Ancient Mariner (3) Ode to a Greciai 
(5) The Excursion 

6. A phrase ''a government of the people, 
the people” was uttered by (1) Washingtor 
houn (4) Lincoln (5) \^son 

7. Gulliver's Travels is the story of (1 
America (2) one of the first African expl* 
a missionary (4) the struggles of a maj 
(5) the imaginary adventures of an English s 

8. Robinson Crusoe is noted for (1) 
meaning (2) its careful presentation of sc 
clear and life-like story (4) its political • 
intimate revelation of the hidden life of the 

9. The Wreck of the Hesperus was wr 
(2) Longfellow (3) Riley (4) Stevenson 

10. The House of Seven Gables was writ 
(2) Hawthorne (3) Poe (4) Cooper 
IL The Pit and the Pendulum was wri 

(2) Whittier (3) Holmes (4) Bryant - 

12. One of Robin Hood's men was (1) I' 

(3) Little John (4) Bill Sykes (5) Mi , 

13. The Call of the Wild was written by 
ling (3) London (4) Stevenson (5) 

14* The ferryman of the Styx was (l)Cr 
(3) Argus (4) Scylla (5) Typhon 


AKSV^ER 

K£V 

Page i 

3 



Fig. 8.— a strip stendl for scoring an English test. 
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is to write down the correct answers on an extra test-sheet. 
Then place the strip immediately to the left of the column 
of answers and write each answer on the strip. The illustra- 
tion shows the stencil applied to an actual paper, the marks 
in the right hand margin representing errors or omissions. 

The scoring of a paper with such a stencil reduces to a 
process of comparing paired numbers, letters, etc. The 
ordinary objective examination, if open to such scoring, can 
be scored in from one to two minutes, depending upon its 
length, the number of different pages, and other factors. 
The author once had occasion to score several thousand 
fifteen-page booklets, of which Fig. 8 shows the first page. 
There were 400 items in the test, all of the t 3 rpe shown. The 
better scorers exceeded the rate of two items per second, 
m aking it possible to check 400 items on fifteen pages, 
count the number of correct answers, and record the score 
on the title page in less than five minutes per booklet after a 
little experience. This rate is many times more rapid than 
that of the reading of any essay examination of comparable 
scope. 

Fig. 9 shows three other strip stencils. Stencil A was 
planned for use with a non-staggered simple-recall test. 
Stencil B is used with a plus-zero response, true-false test. 
Stencil C is used with a true-false test where T and F are 
encircled or underlined. The adaptability of the strip stencil 
to other aligned forms of tests needs no further comment. 

2. The transparent stencQ. Transparent stencils may be 
made either of tissue paper or of celluloid sheets such as 
were formerly much used in side curtains of automobiles. 
The celluloid is much to be preferred, although it is some- 
what expensive. Where large numbers of tests of the stag- 
gered type are to be scored, the celluloid stencil is the only 
very feasible device. The million or two of intelligence 
tests given in the United States Army during the World War 
were scored by means of such stencils. 
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Stencil A Stencil B Stencil C 



Fig. 9.— Various types of strip stencils. 


H -I 
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In making a transparent stencil the first step is to mark 
an extra set of the test papers or booklets so that the correct 
answers are tinderlined (or written in, etc., as the case may- 
be). Then place the celluloid or tissue paper sheet directly 
over the test page thus marked. Assuming that the re- 
siwnses are indicated by underlining, place dots on the 
transparent stencil in such a position that these dots fall 
directly on the middle points of each imderscoring on the 
test page below. Care must be taken to keep the stencil in 
exact place while making the dots. Launderer’s ink is best 
for such stencils. After the dots are dry, dip -the stencil in 
white sheUac and hang it up by one comer for a few minutes 
to dry. Such a stencil is almost indestractible and may be 
used thousands and thousands of times. 

Fig. 10 shows how such a transparent stencil will look if 
superimposed upon an actual test page; the printed or 
mimeographed test material is not shown, although of 
course it does show through such a stencil. Wherever a dot 
(on stencil) appears to bisect the line (on pupil's paper), the 
answer is correct. If a dot and line fail to superimpose, the 
answer is -wrong. If a dot appears but no line, the item was 
omitted. Note how quickly the four errors (three errors 
and one omission) on this one page can be detected. With a 
little practice, the eye will cover such a stencil-test assembly 
very rapidly, and with slight danger of serious error. The 
practic^ scorer using transparent stencils soon becomes 
almost an automaton, the failure of dot and line to super- 
impose actually seeming to “strike him in the face” as the 
eye travels down the page. 

Such stencils may be made of tissue paper or waxed paper, 
a good quality of waxed paper having the advantage that 
it is fairly transparent while still ha-\nng considerable strength 
and resistance to folding and tearing. Such paper stencils 
have short life and are less convenient to handle than the 
more firm and durable celluloid sheets. 
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Teachers who follow our suggestions for duplicate examina- 
tions to be administered in alternate years can well afford to 
make and keep on file scoring stencils. 

3. Cut-out stencils. These will serve much the same 
purposes as the transparent stencil. They are somewhat 
more trouble to prepare, but are less expensive than the 
celluloid stendls. As before, the first step is to write in the 
correct answers on an actual test-sheet or series of sheets, 
particularly in the case of completion tests, computations, 
etc. Then take a sheet of thin cardboard and a piece of 
carbon paper. Place the marked test-sheet upon the card- 
board (the same size as the test sheet) with the carbon sheet 
in between. Draw rectangles around each answer, making 
the rectangles large enough to include the space ordinarily 
required by a pupil’s answer. Remove the cardboard sheet 
and cut out the rectangles as indicated by the lines drawn. 
Below the opening of each rectangle, write the answer which 
would appear in the opening above the rectangle in case 
the pupil answered the item correctly. Fig. 11 shows how 
such a cut-out stencil would look when superimposed upon 
an actual test page. 

Note that two errors appear on this test page. These 
have been checked in the right-hand margin. 

Such stencils may be used for a wide variety of tests, in 
general, for any staggered arrangement of test responses. 
Arithmetic or ^gebraic computation tests lend themselves 
to such scoring. 

4. Answer sheets for reference. Under average condi- 
tions the teacher who has to score but twenty to forty 
papers at a time wiE often feel that she need not prepare an 
elaborate stencil like those already described. For short 
tests the answers can be memorized quickly as the result 
of scoring a dozen or so tests. In such cases a list of answers 
for reference will suffice. In the case of completion tests 
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it is almost necessary to keep a list of answers which have 
actually been found among the pupils' replies. Two such 
lists should be jotted down: (a) those answers for which 
credit is granted, and (&) those answers deemed unworthy of 
merit. Such lists must be growing lists, since each additional 
paper is likely to raise new issues as to scoring. 

It is ordinarily unwise to tise any plan of scoring which 
calls for actual reading of the test items and the pupils’ re- 
sponses. In objective tests it is sufficient to scrutinize only 
the response; any scoring plan which requires attention to 
the printed or mimeographed test item proper is certain 
to be wasteful. Some completion tests require actual 
scrutiny of pupils’ responses, especially in case the answers 
are unusual and do not appear on the stencil. 

A reference sheet may often be placed in dose enough 
juxtaposition to the actual test sheets to allow for fairly 
rapid comparison of recorded and acceptable answers. 

X. DECIDING UPON BULBS FOR SCORING 

Certain aspects of scoring rules have already been touched 
upon in connection with the preceding sections. There is 
little more to add at this time, although Part III of this 
volume re-introduces certain controversial issues related to 
scoring of objective tests. 

A few general statements must suffice for the present. 

1. Avoid giving partial credits. Except in rare cases, 
mark an answer either right or wrong. 

2. Give each test item one point of credit in such tests as 
true-false, simple recall, and multiple-choice. 

3. In completion tests give one point for each blank which 
is correctly filled. 

4. In matching tests give one point for each pair correctly 
matched. 
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5. Do not attempt to weight test items for difficulty or 
relative importance. Such weightings are quite as likely 
to result in injustice as in justice. 

6. Where chance alone gives the pupil the opportunity of 
getting one-half or one-third of the items correct by pure 
guessing, apply the chance correction formula. In two- 
response tests (including the true-false, yes-no, same- 
opposite, etc.) and in three-response (multiple-choice) 
tests, the scores should be corrected for chance. 

The formula for correcting for chance effects is: 

Score-No. 

(»-l) 


or more simply: S=R- 


W 

(«-l) 


•, where: 


S is the corrected score 
R is the number of right (correct) reqwnses 
W is the number of wrong responses 
n is the number of possible responses presented to the 
pupil for each item. 

In two-response tests (including the varieties of the true- 
false), this formula^ becomes: S=^R—W 


For three-response tests, this formula is: S=i?— 

Si milar formulas may be derived for four, five, or more 
responses, but in practice, no correction for chance is 
ordinarily employed when more than three responses are 
presented, i. e., for four-, five-, or more, refuse tests. It 
should also be noted that with tests of mixed character, 
(i. e., when certain items present two alternatives, certain 
present three choices, and certain others four or more 
choices), there can be no method of making allowance for 
guessing and chance effects. 

^The formula S =A —2 IT, where A represents the ntimber of items attempted, is alge- 
braicsdly equivalent to IV, and is usually somewhat more convenient to use- 
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The use of the correction formula for true-false (and 
other two-response tests) is illustrated below. 

Method I 

150 (Possible or maximum score) 

13 (Omissions) 

137 (Attempts, i. e., total number of items attempted) 

26 (Wrongs, i. e., total number of items answered incorrectly) 

111 (Rights, i. e., total number of items answered correctly) 

26 (Wrongs) 

85 (Score, i. e., rights minus the wrongs) 

Method II 

It should be noted that the formula A—2W (attempts 
minus two times the number wrong) gives the same result, 
thus: 

137-2(26) = 137-52 = 85 (Score) 

In explaining to a class the method of scoring two-re- 
sponse tests, the former (and longer) procedure is less likely 
to be misunderstood than is the latter, which appears to 
take off two for each wrong answer. The second procedure 
is likely to be regarded as a double penalizing by pupils. 

Since it is usually desirable to return papers to the class, 
at least for inspection, pupils are certain to raise the question 
of how the score was found. It is worth while to explain 
the method in detail to pupils of high-school age, although 
it is probably wasted effort to attempt to justify the logic 
of the R — W scoring to elementary-school pupils. When the 
issue is raised, try to make the pupils see the reasonableness 
of the right-minus-wrong method of scoring from the stand- 
point of probability. Chapters XI and XII should be read 
in this connection. 

In calculating corrected scores on two- and three-response 
tests, it is necessary to count but two of the three possible 
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responses, viz., the rights, the wrongs, and the omissions. 
In two- or three-response tests of about the “ideal”^ degree 
of difficulty, the number of rights is likely to exceed both the 
wrongs and omissions. It is somewhat easier, therefore, to 
count only the wrongs and omissions. 

The method of correcting three-re^nse tests for chance 
is illustrated by the following example: 

125 (Possible or maximum score) 

9 (Omissions) 

116 (Attempts) 

21 (Wrongs) 

95 (Rights) 

lOf (21 -j-2 or one-half of the wrongs) 

84| (Score) 

The fraction is usually dropped. 

Four-, five-, or more, response tests are not corrected for 
chance in actual practice. 

i‘*Ideal” difficulty is here defined as meaning that the average pupil will earn a corrected 
score which is roughly half of the maximum score* 



CHAPTER Vra 


ILLUSTRATIVE TYPES OF OBJECTIVE TESTS 

Classification of types. Conneau, working under the 
direction of Rice and Ruch, analyzed in detail 375 objective 
or new-type examinations submitted in competition for 
prizes in a national contest for constructing such tests. ^ 

Table 24 gives a brief summary of the tendencies in ob- 
jective test construction. This is the only extensive sum- 
marization of practice which has appeared to date. The 
conditions (a contest for cash prizes) imder which these 
examinations were constructed guarantee a standard of 
excellence which is undoubtedly higher than the average 
objective classroom test. 

Table 24 shows clearly that five types of objective tests 
(completion, true-false, multiple-response, matching, and 
identification exercises, and their variates) make up over 
ninety per cent of the 45,418 test items included in 375 
typicjd examinations. This does not mean that other forms 
are unimportant. Each school subject has its pecxiliar 
needs, and individual types of tests may occur frequently in 
one subject and never appear in certain others. Examples 
of this are to be found in mathematics, where computations 
and problems lead all other tj^pes, although such items were 
seveiith in the total lists. Other cases are map location in 
history and geography, translation and scansion in languages, 
reproduction of poems, laws, and axioms in English, science, 
and mathematics, etc. 

lA. Conneau: Tendencies in Objective Tesiine, in High-School Subjects as Shown by 
Analysis of a R^esentative Sampling of Such Tests (19^), Unpubli^o^ M, A- Thesis, 
University of Cafifomia. 

The best of the 375 tests, togeth^ with certain statistical data, are published in G. M. 
Ruch and G. A. Rice, Sp&nmen Objective Examinations (C^cago: Scott, Foresxnan 
Co., 192^. This reference will give valuable suggestions to the teacher who wishes to 
compare her own tests with those which have been adjudged as having merit by 
competent critics. 
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ILLUSTRATIVE TYPES OF OBJECTIVE TESTS 
TABLE 24 


An Analysis of the Principal Types of Objective Tests in Actual 
Use Throughout the United States (After Conneau) 


Types OF Tests 

Actual 

Numbers 

Per 

Cents 

1. Completion tests, and variates 

13,492 

29.71 

2. True^false, and variates 

10,956 

24.12 

3. Multiple-choice, and variates 

7,473 

16.45 

4* Matching, association, etc 

4,845 

4,165 

10.67 

9.17 

5. Identifications, with or without diagrams .... 

6. Correct form, including re-writing for gram- 


mar, capitalization, etc 

1,254 

2.76 

7. Computations and problems 

805 

1.77 

8. Reaiiangements, mixed sentences, etc 

803 

1.77 

9. Translations, in foreign languages 

347 

0.76 

10. Reproduction from memory, poems, axioms. 



etc 

347 

0.76 

11. Essay questions, short paragraphs 

274 

0.60 

12. Map locations 

264 

0.58 

13. Analogies 

201 

0.44 

14. Constructions, with figures or diagrams 

74 

0.16 

15. Deductions, of conclusions or principles from 



stated premises 

55 

0.12 

16. Redundancies or cross-outs 

45 

0.10 

17. Pronunciation, scansion, etc 

18 

0.04 

Totals 

45,418 

99.98 


No. of examinations analyzed 375 




As a working classification, the following outline of test 
tjrpes is given, the classification being based on the principal 
usages found by analysis of the 375 examinations: 

I. Recall tyiies 

(A) Simple-recall 

(B) Completion 

(C) Short-answer 
II. True-false types 

(A) True-felse (also: — and -hO) 

(B) Yes-no 

(C) Right-wrong 

(D) True-false-doesn’t-say (also: true-false-doesn’t 
know and true-false-can’t-teH, etc.) 
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(E) Converse true-false (in mathematics only) 

(F) True-false with diagrams 

(G) S3mon3nn-anton3ma 

III. Multiple-response (or multiple-choice) types 

(A) Multiple-response, proper (subdivided accord- 
ing to the number of responses presented, for 
example: two-response, three-response, four- 
response, etc.) 

(B) Best-answer 

IV. Matching exercises 

(A) Perfect pairing 

(B) Imperfect pairing 

(C) Multiple matching 

V. Analogies 

VI. Rearrangement tj^jes 

(A) Chronologies 

(B) Order of operation 

(C) Mixed sentences 

VII. Computations 

(A) Examples 

(B) Problems 

VIII. Constructions 

(A) Mathematical figures 

(B) Diagrams (as in science) 

IX. Identifications 

(A) With drawings 
(J5) Without drawings 

X. Reproductions from memory (poems, axioms, la-ws, 
formulas, symbols, equations, etc.) 

XI. Correction of errors (in grammar, punctuation, 
capitalization, spelling, etc.) 

XII. Redundancies or cross-outs 

XIII. Map location 

XIV. Deduction of conclusions from premises 

XV. Translations 

XVI. Miscellaneous and mixed t 3 q)es 
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ILLUSTRATIONS OF VARIOUS TYPES 

The following pages give a number of test fragments 
illustrating the preceding sixteen types of exercises. These 
samples are numbered consecutively for convenience of 
reference. The types are designated by the same numbers 
and letters as were used in the foregoing classification. These 
samples are unedited. Occasional mechanical changes 
have been made in the interests of varying the forms. 
These test fragments are to be viewed as illustrations of 
techniques rather than content. The content, however, is in 
every case some teacher’s judgment of worthwhile material. 

I. BECAUL TYPES 

lA. Simple Recall 

Sample i. {Geography) 

After each state in this list, write its largest dty: 

Indiana Illinois 

Washington Colorado 

Missouri Iowa 

Oregon Alabama 

Sample 2, {Btistness arithmeiic) 

1. Telephone lines with more than one phone are called 

lines. 

2. An is an itemized statement of goods sold, and is 

sent when the goods are shipped. 

3. The one on whom a draft is drawn is called the 

Sample 3, {English) 

1. Women first acted in plays during the century. 


2. wrote novels in letter form. 

3 burlesques the typical 18th and 19th century 

heroine. 


Sample 4. {Liieralure) 

1. was written by 

2. The god who held up the heavens was 

3. The best known work of Coleridge is 

4. *‘The shrub was like a sheeted specter" is an 
example of 

5. Pilgrim* s Progress is an example of the type of 
writing known as 
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Sample 5, {Poetry) 

1. "'God’s in his heaven: 

All’s right with the V'—Brouming 

2. "T, the of all the ages 

In the foremost files of time.” — Tennyson 

3. “ , rest! thy warfare o’er 

Sleep the sleep that knows not breaking.” — Scott 

Sample 6. {Spanish) 

Write on the dotted line the form of the present tense which corresponds 
to the subject pronoun. 

1. compraryo 3. abrir nostros 

2. hacerfel 4. Etc. 

Sample 7, {Mechanical Drawing) 

1. A drawing of a complete machine is called a(n) 

drawing. 

2. In machine drawing, perspective has value. 

3. To put threads in a hole is called 

4. Small screws designated by number instead of diameter are called 


IB. Completion 

Sample 8, {Physiology) 

The human body is composed of small divisions known as 

although ordinarily we cannot these small divisions 

without the aid of a 


Sample 9. {Typewriting) 

1. The spacing from the letterhead to the date ranges firouL 


to spaces, depending on the 

of the letter. The spacing from the date line to the inside 

address is the as from the 

to the or a of 


to 

2. The letters "i” and 

finger of the 


spaces. 

are struck with the 

hand. 


Sample 10. {English) 

1. A tragedy is the portrayal of ... 

which is bound to end because of 
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2. The j5rst act of a play has three functions: , 

and 

3. At the beginning of the story the Primrose family lived in 

, where Mr. Primrose was Their 

greatest friends were the with whom they were 

particularly intimate, because This friendship 

was broken up when Immediately after this mis- 
fortune George Primrose went Later on, like the 

author Goldsmith, he 

Sample 11, {Cooking 


1. In making cream of tomato soup, may be 

added to the to neutralize the 


2. There are 

_ tsp. in one thsp., 


tbsp. in one 

cup, and 

c.inoneqt. 

3. Sugar digests 

and for this 

reason irritates 



Sample 12, (Geometry) 

1. A surface has 

dimensions; 



and 

2. A line has dimension; 

3. A point has dimension; its one property is 


4. Triangles are classified with regard to angles as , 

and triangles. They are classi- 
fied with regard to the length of the sides as , , 

and triangles. 

5. If the sum of two angles is a straight angle, they are called 

angles. 

Sample 13, (History) 

A Greek ship loaded with silk from the Orient comes north along the 

Ionian cities of Asia Minor, through the strait of 

to the European seaport at the entrance to the 

Black Sea. Here the silk is traded for , a chief 

export of that region. The loaded ship returns through the strait to the 

Sea, across this sea to Piraeus, the seaport of 

in the district of Here 

the cargo is again exchanged for an important 

export of this district. Etc. 
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IC. Short-Answer Typesi 


Sample 14, {English) 

Explain in one sentence what connection each of the following people 
had with Goldsmith: 

1. Griffiths 5. Newberry 

2. Dr. Milner 6. Contarine 

3. Sir Joshua Reynolds 7. Dr, Johnson 

4. Jessamy Bride 8. Garrick 

Sample 15. {Latin) 

Identify each of the following in a single sentence. 

1. Galba 2. Pompey 3. Crassus 4. Dunmorix 

Sample 16, {Geometry) 

Write a dear brief statement giving the best reasons you know why each 
statement is true. 

1. A diagonal divides a parallelogram into two equal triangles. 

2. All equilateral triangles are similar. 

3. A diameter is the greatest chord that can be drawn in a cirde. 

4. Every triangle may be inscribed in a circle. 

n. TBUE-FALSE TYPES 

HA. True-False 

Sample 17, {Physiology) 

Underline the “true*' or “false" according to your judgment of the truth 
of each statement. 

1. Tetanus flockjaw) germs usually enter the body through 

open woimds. True False 

2. Pneumonia causes more deaths in the United States than 

tuberculosis. True False 

3. White blood corpusdes are more numerous than are 

themed ones. True False 

Sample 18. {History) 

Draw a drde around the “T” or the “F" depending upon whether the 
statement is true or false. 

1. Lincoln's first puri>ose in entering upon the Civil War 
was to free the slaves. T F 

^The tenn "short-answer” has been used by some writers as a synonym for objective 
or new-type tests, etc. It is used here to mean any not-too-long answer of at least moderate 
objectivity of scoring. 
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2. Pinchot headed the food conservation program of the 

U. S. during the World War, T F 

3. The Dred Scott decision was concerned with the question 

of free silver. T F 

Sample 19. {Shorthand) To be marked + or — . 

1. R, L, P, and B are all downward letters. 

2. The circle vowel is written on the inside of curves, 

3. The vowels in mean, she, and day are joined according to the 

same rule. 

Sample 20. {English) To be marked + or 0. 

1. The story of Beowulf leto* us know something of the Anglo- 

Saxon ideals. 

2. Beowulf was without doubt written during the time of Chaucer. 

3. Little worth-while drama was produced during the Eliza- 
bethan age. 

4. As a whole Puritan literature lacked romantic ardor. 

5. Satire was a prominent element in the literature of the classic 


period. 


— 

Sample 21. {Spanish) 

1, Viva en una casa. 

True 

False 

2. £I padre de mi padre es mi abuelo. 

True 

False 

3. Mi escuda esta en Nueva Yoric, 

True 

False 

4. £1 dinero no le gusta a nadie. 

True 

False 

5. Race calor en d verano. 

True 

False 


Sample 22. {Manual Training 

1. You can tell a rip-saw from a cross-cut saw by the size of the 
teeth, the rip-saw teeth being larger. 

2. No. O sandpaper is smoother than No. 1 sandpaper. 

3. A six-penny nail is longer than a 2Y screw. 

4. Shellac is made from the dry sap of an oriental tree. 

IIB, Yes-No Type 


Sample 23. {Cooking 


1. Is fish higher in protein content than beef? 

2. Are deep-fried foods harder to digest than those fried 

YES 

NO 

in a small amount of fat? 

YES 

NO 

3. Is gelatin a pure protein food?i 

YES 

NO 


^Some teachers prefer the question form as beis^ less likely to “fbif ' false notions in the 
xninds of the (It may be qiiestioned whether the interrogative fcxm really helps.) 
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Sample 24. (Geography) 

1. Did steam improve the transportation of the 17th 

century? F<es No 

2. Does dairy fermland cost less than grazing land? Yes No 

3. Does Oregon produce more lumber than Texas? Yes No 

Sample 25. (Latin) 

1. Is “populus Romanus** used in the plural? 

2. Is “littera” used in the singular when it means an epistle? 

3. Is “milia” always followed by the genitive of the things 

enumerated? 

4. Is “neutri pueri” the correct Latin translation for the phrase 
“neither boy”? 

Sample 26. (Manual Training) 

1. Spiral reamer flutes turn opposite from those of a drill 

2. The lands on a rose reamer are relieved. 

3. Unequal lips on a drill cause oblong holes. 

4. The web of a drill lies between the two cutting edges. 

5. The fliank of a tooth is below the pitch circle. 

IIC. Right-Wrong Type? 

Sample 27. (Sentence Structure) 

1. Glancing down the famous street, signs of every kind 

were visible. Right 

2. Hills that have witnessed from the time of the first 

inhabitants all the exciting events that form the glorious 
past of the dty and the Spanish conquests in search of gold, 
the craving which drew the first American settlers to this 
country. Right 

3. Looking down from my high position, I saw, as night 

came stealing over the brae, a flickering of candlelight in 
the windows. Right 

Sample 28. (Punctuation and Capitalization) 

Check the correct ones with an “R” and the incorrect ones with a “W.” 

1. I had never heard anyone sing “Where is My Wandering Boy 

Tonight?”. 

2. With a condescending air, she handed the biggest package to 

Elmer, my escort, and ordered him to carry it for her. 

3- She pretends to be very intellectual. One of her first gestures 


Wrong 

Wrong 

Wrong 


YES NO 
YES NO 
YES NO 
YES NO 
YES NO 
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was to ask me if I did not thiiik Paradise Lost more interesting than Nize 
Baby? 

IID. True-False, Didn't-Say 
True-False, Don’t-Know 
True-False, Can't-Tell 

Sample 29, {Geometry) 

Mark the following statements ‘T/’ “F,’* or “D” according to whether 
they are always true, always false, or sometimes true (doubtful), 

1. Two triangles are congruent if — 

{a) Three sides of one triangle are respectively equal to 

the sides of the other triangle. T F D 

{h) Three angles of one triangle are respectively equal to 

three angles of the other triangle. T F D 

(c) Two sides and the included angle of one triangle are 
respectively equal to the two sides and the included 
angle of the other triangle. T F D 

ILE. Converse True-False Type 
Sample 30, {Geometry) 

If the converse of each of the following statements is true, underline the 
words *‘converse-true,” If the converse is not true, underline the words 
“converse-false.” 

1. In the same drde or in equal circles equal chords subtend equal 

arcs. Converse-true Converse-false 

2. If a line divides two sides of a triangle proportionately, it is parallel 

to the third side. Converse-true Converse-false 

3. Congruent figures are necessarily equal in area. 

Converse-true Conva^felse 

IIF. True-False With Dugrams 


Sample 31, {Geometry) 


A— 

1001 

—B 



R— 

JlOO^ 

—S 

1. Are AH and parallel? 

YES NO 







50T — --- 


2, Are AB and RS parallel? 

YES NO 

R— 

1001 

—s 
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IK?. Synonym-Antonym 
Sample 32. {Reading Vocabulary) 

If the words of a pair mean the same or nearly the same, draw a line under 
same. If they mean the opposite or nearly the opposite, draw a line under 


opposite, 

1. cold-hot same-opposite 

2. knave-villain same-opposite 

3. recoup-recover same-opposite 

4. plenary-complete same-opposite 

5. adventitious-accidental same-opposite 


HL MULTTPLE-EESPONSE (MULTIPLE-CHOICE) TYPES 

IIIA, Two-Response Type 

Sample 33. {History) 

1. The first president of the Confederacy was Lee Davis 

2. The turning i>oint of the Civil War is usually taken as the battle of 
Bull Rim Gettysburg 

3. The Dred Scott Decision was concerned with slavery tariff 
Sample 34. {Language) 

1. The children (sung, sang) several songs. 

2. I thought he (did, done) it. 

3. Since then I have never (run, ran) away. 

Sample 35. {Language) 

1. I ^ the man today. 

2. The tardy bell ^ sometime ago. 

Q Mnrr beautiful 

0. one sang beautifully. 

Sample 36. {Latin) 

Strike out the incorrect form. 

1. (Domi) (Domum) est. 

2. Cur pontem (feceris) (fedsses) sdo. 

3. Caesar erat (dux bonus) (ducem bonum). 

IIIA, Three-Response Type 
Sample 37. {Commercial geography) 

1. A natural building stone is — cement-granite-tile 

2. The Columbia River is noted for — cod-sardines-salmon 

3. Most automobiles are made in — ^Michigan-Ohio-New York 
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Sample 38. {Latin) 

1, The town was not like the dty. 

Oppidtim (urbi, tirbe, urbem) simile non erat. 

2. Many of the Helvetians had been wounded. 

Multi (Helvetiis, Helvetios^ Helvetiorum) vulnerati eranL 

IIM. Four-Response Type 
Sample 39. {Physiology) 

1. The normal pulse rate is about 48 70 98 112 

2. The trunk is divided into two main cavities by the ribs diaphragm 
oesophagus vertebrae 

3. The absorptive action of the small intestine is greatly increased by 
the villi pylorus pancreas spleen 

IIIA. Five-Response Type 
Sample 40. {General Science) 

1. The freezing point on the Centigrade thermometer is —273® 0® 
32® 100® 212® 

2. A gas which supports combustion is hydrogen nitrogen carbon 
dioxide oxygen carbon monoxide 

Sample 41. {History) 

Write the number of the correct answer on the line at the right 

1. Peter the Great was a ruler of (1) England (2) Holland 

(3) Russia (4) Gaul (5) Denmark 

2. The first state to secede from the Union was (1) Virginia 

(2) South Carolina (3) Delaware (4) Missouri (5) North 
Carolina 

3. The minimum age for a voter is (1) 18 (2) 19 (3) 20 

(4) 21 (5) 25 

IILB. Best-Answer Type 

Sample 42. {Biology) 

1. Leguminous plants play an important role in nature because: 
Bacteria associated with their roots return nitrogen to the soil. 
They will grow on soil too poor to support other crops. 

The economic value of the hay crop is very large. 

2. The best of these definitions of photosynthesis is: 

The action of sunlight on plants. 

The process of food manufeicture in green plants. 

The process by which plants give off oxygen. 
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Sample 43, {Mathematics) 

1. Adjacent angles 
are always equal. 

always have a common side and vertex. 

if added together make 90 degrees. 

2. An angle of 30® is vertical to an angle 

of 30® if they have a vertex in common; the left side of the one 

and the right side of the other form a straight line. 

of 60® if the left side of one and the right side of the other, 

and vice versa, form a straight line. 

of 30® if they have a common side and a common vertex. 

IV. MATCHING EXERCISES 

IVA. Perfect Matching 

Sample 44, {English Literature) 

AUTHORS ANSWERS 

1. Oliver Goldsmith ..5... 

2. Jane Austen 

3. George Eliot 

4. Matthew Arnold 

5. Charles Dickens 

6. Lord Byron 

7. Samuel Richardson 

8. Robert Bums 

9. William M. Thackeray 

10. James Boswell 

11. JohnRuskin 

12. John Keats 

13. T. B. Macaulay 

14. Alfred Tennyson 

15. P. B. Shelley 

Sample 45, {Manual Training) 

Match each style of furniture with its characteristic. 


1. Mission Maple or birch; a turned job 

2. Windsor Massive and plain 

3. Louis XV Elaborate with delicate carvings 

4. Jacobean Delicate and graceful 

5. Chippendale Spiral turnings; usually finished in ‘‘antique 

oak” 


WRITINGS 
David Copperfield 
Life of Johnson 
Henry Esmond 
Tam O'Shanter 
Pamela 

The Prisoner of Chillon 
Ode to the West Wind 
Sohrab and Rustum 
Locksley Hall 
Mill on the Floss 
Lays of Ancient Rome 
Pride and Prejudice 
Eve of St. Agnes 
The Deserted Village 
Modem Painters 



ILLUSTRATIVE TYPES OF OBJECTIVE TESTS 


201 


Upon what part of a building as given in Column Two does each part of 
Column One rest? 

1. Floor boards sills 

2. Rafters rafters 

3- Roof sheathing plates 

4. Joists joists 


IVR. 

Sample 46. (History) 

MEN 

1. Thomas H. Benton 

2. Thaddeus Stevens 

3. George B. McClellan 

4. Carl Shurz 

5. Thomas Jefferson 

6. Miles Standish 

7. De Witt Clinton 

8. Charles Sumner 

9. Sir Walter Raleigh 

10. Christopher Columbus 

11. Vasco de Balboa 

12. David Wilmot 

13. Woodrow Wilson 

14. William Bradford 

15. Richard Hoe 


Imperfect Matching 

CHARACTERIZING PHRASE 
( 5) Wrote the Declaration of Independence 
( ) For 30 years a senator from Missouri 
( ) An immigrant who worked for political 
reform 

( ) Leader of Union Army in Peninsula 
Campaign 

( ) Congressman demanding harsh treat- 
ment of the South 

( ) Discoverer of the New World for Spain 
( ) Spent a fortune to found a colony in 
America 

( ) Military man of Plymouth; celebrated 
by Longfellow 

( ) Massachusetts senator who denounced 
the ‘Crime Against Kansas*' 

( ) Governor of New York — promoted the 
Erie Canal 


rvc. Multiple Matching 
Sample 47. (Literature) 

Write on the lines after each character the words (from the column at 
the right) which best fit that character. Each word may be used more 


than once. 

1. Gawain (a) 1. fickle 

(h) 2. unhappy 

(c) 3. powerful 

2. Arthur (a) 4. idealistic 

(b) 5. untrustworthy 

(c) 6. of great purpose 

3. Lancelot (a) 7. a gossiper 

(6) 8- disgusted with himself 

(c) 9. courteous 

10. rude (Etc.) 
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V. ANALOGIES 

Sample 48, {Geometry) 

1. Two points : a straight line :: : a plane 

2. A triangiilar face : :: a rectangular face : 

3. Vertex : plane angle :: edge : 

Sample 49, {Algebra) 

1. Exponent : a ntimber :: index : (number, coejfficient, exponent) 

2. X : Zx :: 6 : (99, 20, 27, 12, 18, 30, 6) 

3. Monomial : binomial :: binomial : (monomial, binomial, tri- 
nomial) 

Sample 50. {Ancient History) 

1. The Book of the Dead was to the Egyptians as the 

was to the Persians, and as the was to the Hebrews, 

and as the was to the Mohammedans. 

2. Zeus was to the Greeks as was to the Romans. 

Mercury was to the Romans as was to the Greeks. 

Demeter was to the Greeks as was to the Romans. 

Athena was to the Greeks as was to the Romans. 

3- Enlil was to the Babylonians as was to the Persians, 

and as was to the Hebrews. 

TL BEABBANGEMENT TYPES 

VIA. Chronologies 

Sample 51. {History) 

Arrange these “issues*' according to the order of their appearance m 
American political history. 

( ) Reduction of the surtax 
( ) Free coinage of silver 
( ) Internal improvements at national expense 
( ) “54-40 or fight” 

( ) Entering the League of Nations 

Sample 52. {English Literature) 

Re-arrange the following events fix>m the first two books of the Aeneid 
in the order in which Vergil tells about them. 

The banquet in Dido's palace 
The struggle in the palace of Priam 
The death of Laocoon 
Venus tells Aeneas the story of Dido 
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The stonn off the coast of Sicily 

The vision of Hector appears to Aeneas 

The struggle with the band of Greeks led by Androgeus 

VIjB. Order of Operations 
Sample 53. (Manual Arts) 

1. In starting an automobile, what is the order in which the fc^owing 
things should be done? (Number 1, 2, 3, etc.) 

cranking or stepping on the starter 

turning on the ignition 

retarding spark 

choking 

putting in neutral 

2. In glazing a window, what is the proper order for these jobs? (Num- 
ber 1, 2, 3, etc.) 

cutting glass to size 

putting on thick putty 

lotting glass in place 

putting on thin putty 

cleaning out old putty 

painting rabbet with linseed oil 

driving in glazier points or brads 

Sample 54. (Cooking) 

The following are the steps in making muflSns. Indicate by 1, 2, 3, etc., 
the order in which you would perform the steps. 

bake 

measure and sift ingredients (dry) 

add melted fat 

add egg 

assemble utensils and ingredients 

add liquids 

place in muffin tins 

VIC. Mixed Sentences 

Sample 55. (Latin) 

Number the words in each sentence 1, 2, 3, etc,, to show the correct 
word-order. 

1. Insulae habent multas form^. 

2. Equi trahunt carros. 

3. Romani vi^ bonas multas muniebant. 

4. Amiserunt in multi bello vitam. 
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vn. COMPOTAHONS 

VIIA. Examples 

Sample 56, (Algebra) 

1. Add: — 3x, and 6x 

2. What does (3a— 2a*— 1) plus (a*— 1) plus (3a*+2a) equal? 

3. Multiply (2%+y) by (jc— y) 

Vlli?. Problems 

Sample 57, (Algebra) 

1. A rectangular field is y feet wide and 40 feet long. What will represent 
its perimeter? 

2. A rectangle is three times as long as it is wide. If each dimension is 
increased by 4 inches, the rectangle will be twice as long as it is wide. 
Find its length and width. 

Sample 58, (Chemistry) 

1. How many liters of sulphur dioxide at standard conditions will be 
obtained when 52 grams of sodium acid sulphate completely react with 
hydrochloric add? Atomic weights: 

Sodium =23 Hydrogen =1 Sulphur =32 Oxygen = 16 

2. A compound contains 70% iron (at. wt., 56) and 30% oxygen (at. 
wt., 16). Find the simplest formula. 

Vm. CONSTRUCTIONS 

VIIM. Mathematical Figures 
Sample 59, (Arithmetic) 

1. Draw a line through point A so as to form an angle of 45 degrees with 

line AB A B 

2. Construct a right triangle whose sides are, respectively, IJ, 2, and 
2§ inches. 

3. Through C, draw a i)erpendicular to AB, A g B 

VIIIR. Science Diagrams 


Sample 60, (Botany) 

1. Draw a cross-section plan of the flower of the buttercup. 

2. Draw a longitudinal section of a grain of com. 

3. Show the arrangement of the F. V. B. in the stem of a endogen. 
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EL IDENTIFICATIONS 

IXA. With Drawings 

Sample 61. {Zoology) 

Write the name of each structure which bears a number 

1 

2 

3 

5 

6 

7 

8 

IXB. Without Drawings 



Sample 62. {Chemistry) 

In the following list, identify by endrding E, C, M, X, respectively, 
the elements, compounds, mixtures, and any other classification. 


1. water gas 

E 

C 

M 

X 

2. lamp black 

E 

C 

M 

X 

3. alloy 

E 

c 

M 

X 

4. anunonia 

E 

c 

M 

X 

5. air 

E 

c 

M 

X 

6. alum 

E 

c 

M 

X 

7. brass 

E 

c 

M 

X 

8. caustic soda 

E 

c 

M 

X 

9. radium 

E 

c 

M 

X 

10. steam 

E 

c 

M 

X Etc. 


Sample 63. {German) 

Identify each of the following in a short sentence in German. 

1. Bri^tte 

2. Die weisse Taube 

3. Das Sthermadchen 

Sample 64. {English) 

Identify each of the following by naming the story in which it occurs. 

1. The pillar of fire by night 

2. The Sea Maid 

3. The man who wanted to write a story, but was laughed at by his wife 
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X. REPEODUCnONS 

Semple 65, {English) 

Quote ten lines from “As You Like It.” 

Sample 66, {Chemistry) 

CHEMICAL NAME FORMULA 

1. Sodium phosphate 

2. Sulphuric acid 

3. Ammonium bromide 

4. Calcium carbonate 

XI. CORRECTION OF ERRORS 
Sample 67, {Grammar) 

Draw a line through each word which is unnecessary or incorrectly used. 

1. I feel pretty good. 

2. Who did you see? 

3. They don't know nothing. 

4. He jumped off of the car. 

5. Him and I were unable to go. 

Sample 68, {Grammar) 

Draw a circle around any error or omission in spelling, punctuation, 
capitalization, or grammar in the following sentences. The first three are 
marked correctly as samples. If the sentence contains no errors, write 
“Correct” on the dotted line at the right. Correct each error found by 
rewriting the correct form on the dotted lines. 

1. I saw him yesterday. 

2. The meeting was called by Jones. 

3. Who is that manQ f_ 

4. Busness promises to improve. 

5. What a pity!, she exclaimed. 

6. The tardy bell has rung. 

7. There goes president Coolidge. 

8. You the leader, should go first. 

9. I am just wild about classical music. 

10. He don't appear to be very intelligent. 

XU, REDUNDANCIES 

Sample 69, {Geometry) 

Place, in the space to the left of each of the following statements in proof 
of the theorem stated, the letter “N” if the statement is a necessary step 
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in the proof of the theorem and the letter if the statement is un- 
necessary to the proof. {Also correct any errors that you find.) 



theorem; The sum of the 
three angles of 
a triangle is 
equal to two 
right an^es. 
given: Triangle ABC 
TO prove: Angle n plus 
angle s plus an- 
gle BCA is 
equal to two 
right angles. 


STATEMENTS 

1. Produce AC through C to D. 1. A straight line may be 

drawn connecting any two 
points. 

2. From C draw CE bisecting angle 2. To draw a line parallel to 

BCD a given line. 

3. Angle splus angle n plus angle 3. The sum of all the angles 

BCA equals two right angles. about a point on the same 

side of a straight line 

through that ix>int is a 

right angle. 

A, Angle s equals angle B, _,4- If two lines are cut by a 

transversal, the alternate 
interior angles are equal. 

5. Angle n equals angle B. 5. If two parallel lines are cut 

by a transversal, any cor- 
re^nding an^es are ad- 
jacent. 

.6. But angle s equals angle n. 6. Construction. 

J. Therefore angle A equals an^e 7. Things equal to the same 

B. thing are equal to each 

other. 

8. Substitute in (3) above angle s 8. A quantity may be subsd- 

for its equal ani^e A and angle If tuted for its equal in any 

for its equal angle B. process. 

A plus B plus BCA equal two right angles. 
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Sample 70. {Englisfi) 

Cross out the words that make the statements incorrect 

1. Poe’s literary work is remarkable for its artistic finish, realism, sad- 
ness, moral ideas, and special technique. 

2. Mark Twain is best remembered for his hatred of hypocrisy, refined 
humor, romantic history of western life, long detailed descriptions, and 
strong sense of justice. 

Sample 71. {Reading comprehension) 

Cross out the word or words that spoil the sense of the sentence. 

1. It was a very hot day, and I went at once into the house and put on 
my fur overcoat. 

2. She was a beautiful girl with long curls, blue eyes, a wdl-shaped nose, 
and very even yellow teeth. 

3. The man stood with his hands in his pockets as he pointed out to me 
the road to Boston. 

Xin. MAP LOCATION 

Sample 72. {History) 

Study the map below carefully. Notice that the towns and cities shown 
are numbered. Write as your answer the numbers of the cities, in order, 
through which Sherman passed on his famous ‘"March.” 

Your answer should include seven numbers. 
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Sample 73. (Geometry) 

On the right below are a number of axioms; on the left are some figures 
and some statements of equality or inequality. Study the figure, the state- 
ments, and the questions. Then select the axiom from the list at the right 
which gives the answer to the question after each figure or statement on 
the left, and write within the parentheses the letter representing the axiom 
which correctly answers the question. 


A 

B C 

« 1 

D 

Given: AC’bBD 
mydoesAjB*CD? ( 

) 

A 

B C 

D 

Given: 

and 
Why is 

AB>BC 
BOCD 
AB>CD? ( 

) 

A 

P c 


Given: 

and 
Why is 

AB>BC 
BOCD 
AOBD? ( 

) 

M 

y Q 

P. 

Given: 

MN-.OP 



Why does MO ( ) 
Etc. 


(A) If equals are added to unequals 
in the same order, the sums are unequal 
in the same order. 

(B) If equals are divided by equals, 
the quotients are equal. 

(C) If equals are added to equals, 
the sums are equal. 

(D) If unequals are subtracted from 
equals, the remainders are unequal in 
the reverse order. 

(E) If equals are subtracted from 
equals, the remainders are equal. 

(F) Halves of equals are equal. 

(G) like powers or like roots of 
equals are equal. 

(H) If of three quantities, the first 
is greater than the second aiKl the 
second is greater than the third, then 
the first is greater than the third. 


XY. TRANSLATIONS 


Sample 74. (Latin) 

Write the correct Latin translation below the English sentence. 

1. The wc»nan whom you see is the queen. 


2. His troops wfil fight as bravdy as possible. 


3. The wretched king blamed himself. 
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Sample 75. {German) 


SSfllf ttttb ha^ fiatttm 

^in SBoIf oittb tin 2amm ftonbcn am Hfer cine^ am gu 

irinlen. Obcn fianb ber SBoIf, omten bo^ Somm. ^aum ^tte ber 
SSoIf ba§ £cmtm gefel^cn, fo ftag cr 9 let(§ ctnen (Btrcit on. wSBaritm 
tnibft bu imr ba§ SBoffer?" fd^rte et toutcnb. gittemb anttoortctc ba§ 
fiatmn: ,,SBte form bir ba^ SSoffer tr^en? €§ ja umi btr gu 
mir l^crob." SJcr SBoIf fd^omte fid^; bcnn ba§ £amm l^ottc bic 
l^cti gefjjrod^. S)eTinod) fagte cr gomtg: ,,58or fiebcn SOJonoten l^aft bu 
mt£5 gefc^al^t." <Sanft crtoibertc ba§ -©d^crjien: „S&or ficbcn Tlma^ 
itn iDctt td& ia uod^ gar ntc§t gcBorcn." ,,^ann 5at e§ bctn S3atcr 
gcton/ rief ber SBioIf toiitcnb, ergriff ba§ £dmm(5en unb frefe eg. 

/ 

Read the German story and then answer the questions in English. > ^ 

1. Where did the wolf and the lamb stand? 

2. Of what did the wolf accuse the lamb? 

3. Why was it imix)ssible for the lamb to have done what the wolf blamed 

it for? 


4. What happened to the lamb at the end?. 


XVI. MISCELLANEOUS AND MIXED TYPES 

Sample 76. {Pimctuation) 

Punctuate the short paragraphs below. 

1. Hello said the little old gentleman thats not the way to answer ths. 
door. Im wet let me in. 

2. I beg pardon sir said Gluck Im very sorry but I really cant do it 
because my brothers would beat me to death sir. What do you want. 

Sample 77. {Appreciation of Poetry) 

Directions: Mark the five (5) of the following ten passages which you 
feel sure would be considered as having the most beautiful language. 
Make your decisions in terms of your own feeling. 

1. Her mother died when she was young 

Which gave her cause to make great moan; 

Her father married the worst woman 
That ever lived in Christendom. 
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.2. A host of Phantom listeners 
That dwelt in the lone house then 
Stood listening in the quiet of the moonlight 
To that voice from the world of men. 

.3. A damsel with a dulcimer 
In a vision once I saw; 

It was an Abyssinian maid 
And on her dulcimer she played 
Singing of Mount Abora. 

.4. Pack clouds away and welcome day; 

With night we banish sorrow; 

Sweet air, blow soft, mount lark aloft. 

To give my love good-morrow. 

.5, The flowers do fade and wanton fields 
To wayward winter reckoning yields; 

A honey tongue, a heart of gall 
In fancy’s spring but sorrow’s fall 


10. Death, be not proud, though some have called thee 

Mighty and dreadful, for thou art not so; 

For those whom thou think’st thou dost overflow 
Die not, poor death. 

Sample 78. {Latin) 

Identify by giving the Latin word from which the underscored word is 
derived. 

1. Mr. Smith is the senior member of the firm. 

2. Mary’s hand was so small that she could not 

stretch an octave. 

3. His vision is very bad. 

Sample 79. {Latin) 

Opposite each statement give the Latin construction illustrated by the 
word or words underlined. 

1. Caesar was a man of great courage. 

2. No one trusts those barbarians. 

3. Labiemus is in command of the legion. 

4. The scout said that the dtv had been found. 
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Sample 80. (Science) 

Cross out the word that does not belong with the others- 

1. lungs breathe respiration cough swim 

2. fire water bum bla 2 e heat 

3. run stand march walk skip 

4. beef mutton fish pork veal 

5. stomach intestines mouth ear throat 



CHAPTER IX 

SELECTED COMPLETE EXAMINATIONS 

Introduction. Chapter VIII presented eighty short 
extracts representing the commonest types of objective test 
techniques, as found by analysis of nearly four hxmdred 
tests and examinations. ^ The present chapter will show a 
number of complete or semi-complete examinations selected 
with the following ideas in mind: 

1. To show long or reasonably long tests which represent 
extended sampling. 

2. To show a wide variety of test procedures. For this 
reason certain fields normally of not much interest to 
elementary- and high-school teachers are represented. 

3. To show tests developed by classroom teachers, 
research agencies, official examining bodies, professional 
test-makers, etc. 

4. To show approaches to the measurement of certain 
kinds of subject-matter ordinarily thought of as not being 
amenable to objective measurement. 

5. To show methodology of test construction rather 
than to illustrate ‘‘ideal” content, although the content 
of many of the examinations represents a high level as well. 

The presentation of a large number of complete exam- 
inations is very space consuming. The reader who wishes 
to examine other typical examinations and tests will do well 
to study the volume mentioned in the footnote below and 
certain of the references listed in the General Bibliography. 

In some instances these examinations contain faulty 
items, often in violation of principles laid down in this 

^Submitted in a national contest for twenty-fivn cash prizes. Thirty-five of the bcrt 
of these examinations, includmg all the i»ize-winninsMbests, are pablishM in G. M. Ruch 
and G- A. Rice, Sptcimtn Objective Examinations (Chica^: Scott, Foresman and Com- 
pany, 1929). This volume is planned as a companion book to the x>resent tr e a tment , 

213 
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text. The author did not feel at liberty to make more 
than minor changes. 

EXAMINATION I 

The foUovring series of elementary-school subject-matter 
tests was constructed in the office of the Superintendent of 
Schools of Kem Coimty, California, under the direction of 
former Superintendent L. E. Chenoweth and present 
Superintendent H. L. Healy. Kem County has for several 
years prepared quarterly examinations similar to the one 
reproduced here for the sixth grade. Each grade has a 
different examination.^ 

KERN COUNTY BOARD OF 
EDUCATION REVIEW 
1928 

Let us see how well you have learned some 
things during the first two quarters. First of 
all, fill in these blanks, writing very plainly: 

Name. Boy or Girl 

Age .Grade .Teacher. 

School Date. 

Now do just what the printing tells you to 
do. There are many questions to be answered. 

Perhaps you will not be able to answer all of 
them. That makes no difference. Just do the 
bestyoucan. Do not go too fast, and try not 
to make any mistakes. If you come to a 
question you do not imderstand, go on to the 

^After this chapter was written, a volume has appeared which gives a detailed account 
of a similar series of exajxunations in Lewis County, New York State. See J. S. Orleans and 
G. A. Seaijr, Objective Tests (Yonkers-on-Hudson; World Book Company, 1928), pp. 
x+373. This book, although not a complete treatment of the theonr of examination con* 
structioo, is an exceedingly valuable addition to the practice of objectLve testing. 


6 

SCORE 

Arithmetia 

Language. 

Reading 

Geography. 

History. 

Music. 

Morals & Man 

Phys. Ed 

Spelling. 

Writing. 

TOTAL 
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next one, and come back to the hard one later on if 3 ^ have time. There 
are three kinds of questions. Read each question. If a statement is TRUE, 
put an(X) in the parentheses. If it is FALSE, put a (— ) in the paren- 
theses. 

Here is a sample question. Try it for practice. Is the statement true? 


Christmas comes in June every other year. ( ) 

It is false; so the answer should be ( — ). 

Here is another kind of question: 

Columbus discovered— ( ) 


You should write the word America on the line between the parentheses. 

Here is another kind of question: 

Which floats best on the water? 1. iron; 2. stone; 3. wood; 4. steel; 

5. brick. (. .) 

The right word is *Vood'’ and a line shoifld be drawn under that word and 
the number '‘3” put in the parentheses. 

REMEMBER always to put your answer in the parentheses. 

Now you are ready to start when the word is given. Do the best you 
can. You will have one hour in which to finish PART I. Then rest for a 
while and take another hour for PART II. Do all you can the best you 
can. You may use scratch paper, if you need it, for any of the Arithmetic 


work. 

PART I 

ARITHMETIC. 6 

1 . is in lowest terms ( ) 

2. The sum of J plus § equals | ( ) 

3. The sum of .08 plus .009 equals .089 ( ) 

4. The difference between .20 and .07 equals .13 ( ) 

5. ^equals 25% ( ) 

6 . The product of f times 8 equals ^ ( ) 

7. 25% of 40 equals 10 ( ) 

8 . 280 is divisible by 3 without remainder ( ) 

9. The difference between Sj and 2f equals 2|. ( ) 

10 . J divided by f equals f ( ) 

11. Half of the sum of 6 and 12 equals— (1) 6 , (2) 9, (3) 2 ( ) 

12. 25fi is what part of a dollar?— (1) i, (2) (3) ^ ( ) 
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13. 7% of $300 equals (1) $42|-, (2) $2100, (3) $21 ( ) 

14. The product of 7.5 times .03 equals — (1) 2.25, (2) .225, 

(3) 22.5 ( ) 

15. 30 divided by f equals— (1) 36, (2) 150, (3) 180 ( ) 

16. 4.2 divided by 6 equals — (1) 7, (2) .7, (3) .07 ( ) 

17. Mr. Sears paid $100 for a horse, $12 for a harness and $48 for a 

wagon. To find the total cost you should — (1) add, (2) sub- 
tract, (3) multiply, (4) divide ( ) 

18. Ruth^s cyclometer read 586.7 at starting and exactly 1175 
when they reached home. To find how far she rode, you should 

— (1) add, (2) subtract, (3) multiply, (4) divide ( ) 

19. To find how many yards of ribbon are needed to make a dozen 
belts each f yd. long, you should — (1) add, (2) subtract, 

(3) multiply, (4) divide ( ) 

20. Lucy reckons that it costs her $28.75 to feed 15 hens for a year. 

To find the cost per hen for a year, you should (1) add, (2) 
subtract, (3) multiply, (4) divide. .. .* ( ) 

21. Express as a common fraction and write your answer in the 

parentheses — .75 ( ) 

22. Write the missing number in the parentheses — 12 equals 

50% of. ( ) 

23. Write the missing number in the parentheses. Traveling for 

5 hours at 18 miles per hour you go a distance of miles ( ) 

24. What is the cost for one article when you get 10 for $.25? ( ) 

25. Supply the missing number and write yom: answer in the 

parentheses — .18 divided by .05 equals ( ) 


26. A girl buys 3 articles at 75c each and pays $5. What change 


should she receive? Write your answer in the parentheses. ( ) 

27. Write the missing number in the parentheses. A man 
bought 3 drums for $1.40, $1.35, and $1.00. The average 

cost was ( ) 

28. John collects rent money for his father. He is paid .02 
times what he collects. Write in the parentheses how much 

he is paid when he collects $14.50 ( ) 

29. Lucy earned $22.50 in the summer. She spent $4.25 for 
books and $4.05 for clothes. The rest of the money she put 
in the savings bank. Find the amount she put in the savings 

bank, and write your answer in the parentheses ( ) 
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30. Alice and Helen expect to pick 360 baskets of blackberries 
this summer and to sell 70 per cent of them. How many 
baskets do they expect to sell? Write yoiar answer in the 
parentheses ( ) 

LANGUAGE. 6 

1. All sentences should begin with a (1) capital, (2) sma ll letter . . ( ) 

2. Charles (1) may, (2) can run very fast ( ) 

3. (1) May, (2) can I make a shortcake, if I can find some 

berries? ( ) 

4. (1) Can, (2) may I have your permission to go? ( ) 

5. Mother (1) may, (2) can I go to visit Mary? ( ) 

6. The apples (1) laid, (2) lay on the ground ( ) 

7. Tom (1) came, (2) come home today ( ) 

8. John (1) sit, (2) set up in your seat ( ) 

9. Please (1) set, (2) sit down ( ) 

10. The abbreviation for January is (1) Jan., (2) Janu ( ) 

11. A peck of peanuts (1) was, (2) were sold ( ) 

12. (1) Has, (2) have each one his book open? ( ) 

13. Call for Clara and (1) me, (2) I ( ) 

14. A bushel of apples (1) were, (2) was sold ( ) 

15. There (1) was, (2) were three of us on the sled ( ) 

16. Each child is in (1) his, (2) their seat ( ) 

17. I shall (1) teach, (2) learn you this trick ( ) 

18. (1) Was, (2) were you there yesterday? ( ) 

19. Will you (1) learn, (2) teach me to play? ( ) 

20. Every one can do this if he (1) tries, (2) try. ( ) 

21. The children (1) came, (2) come home tired ( ) 

22. They (1) ran, (2) run all the way. ( ) 

23. We did not know they had (1) went, (2) gone until after 

dinner ( ) 

24. May Ethel and I (1) sit, (2) set together? ( ) 

25. You may (1) sit, (2) set the brown hen ( ) 

26. The brown hen will (1) set, (2) at on the eggs ( ) 

27. Set the chair over there and (1) set, (2) sit down ( ) 

28. You (1) may, (2) can refer to your books ( ) 
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29. Father said I (1) could, (2) might bring Susan to school ( ) 

30. It was (1) they, (2) them who broke the window ( ) 

READING. 6 

1. Loki is a character in — (1) Siegfried and the Dragon, (2) The 

Argonauts, (3) Miles Standish ( ) 

2- The early explorer who said, ‘'Sail on! Sail on! Sail on!*’ was — 

(1) Columbus, (2) Drake, (3) the leader of the Pilgrims ( ) 

3. ‘The Argonauts” was written by (1) Macaulay, (2) Sir Walter 

Scott, (3) Kingsley ( ) 

4. Jason was the teacher of Chiron ( ) 

5. “The Pied Piper of Hamelin” teaches — (1) It pays to keep a 
promise, (2) children like music, (3) the river is a good place 

to drown rats ( ) 

6- ‘The Moonlight Sonata” was composed by (1) a shoemaker, 

(2) a blind girl, (3) Beethoven ( ) 

7. “The Inchcape Rock” teaches that — (1) It is fun to sink a bell 

in the ocean, (2) we suffer for our evil deeds, (3) the Abbot 
of Aberbrothok was foolish to try to warn sailors by means of 

abeU ( ) 

8- The “Landing of the Pilgrim Fathers” was written by (1) Hunt, 

(2)Hemans, (3) Miller ( ) 

9. “Urgan” referred to in “Alice Brand” was a giant ( ) 

10. Herminius is a character in — (1) Horatius, (2) The Cadi's 

Decision, (3) The Moonlight ^nata ( ) 


11. “They sought a faith's pure shrine,” means that they had 
heard of — (1) a very beautiful altar and were looking for it, 

(2) they were looking for a place to worship as they believed 
right, (3) they were coming to America to build cathedrals. . ( ) 

12. “A Christmas Carol” was written by (1) Browning, (2) Ten- 


nyson, (3) Dickens ( ) 

13. Macaulay wrote “Horatius at the Bridge” ( ) 

14. ‘The Pied Piper of Hamelin” was written by Hemans ( ) 

15. The Pied Piper led the children into the river ( ) 

16. Hamelin was on the river Rhine ( ) 

17. John Maynard forsook the ship and left all to perish ( ) 


18. Southey teaches in ‘The Battle of Blenheim,” that— (1) war 
always has a glorious victory for some one, (2) the common 
soldiers always gain most in a war, (3) war is uselessly cruel 
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and wicked ( ) 

19. In ‘TIoratius at the Bridge” Hcnatius saved Roxxie by swimming 

the Tiber ( ) 

20. A “cadi” is a (1) sheik, (2) judge, (3) b^gar ( ) 

21. Jason’s men were saved by the sirens ( ) 

22. “The Professor of Signs” — (1) is just a funny stoiy, (2) teaches 

that a lowly man should keep his temper, (3) shows how 
easily the same facts may be inteipreted in more than one way. ( ) 

23. Jason sought the Golden Fleece ( ) 

24. King Arthur pxiUed the sword Excalibur from (1) his scab- 
bard, (2) a stone, (3) a tree ( ) 

25. The poem “Columbus’* was written by (1) Coleridge, (2) 

Miller, (3) Southey ( ) 

26. The story of “A Wonderful City” describes (1) a nest of ants, 

(2) a dty in Asia, (3) a beehive. ( ) 

27. The poem “April Rain” was written by (1) Mrs. Browning, 

(2) Loveman, (3) Longfellow ( ) 

28. ‘The Inchcape Rock’* was written by Wilcox ( ) 

29. Sir Ralph the Rover placed the bell on the rock ( ) 

30. ‘The Man Worth While” was written by Ella Wheeler Wilcox ( ) 


END OF PART I. STOP HERE. GO BACK AND SEE THAT ALL 
OF YOUR ANSWERS ARE RIGHT. 

PABXn 

GEOGRAPHY. 6 

1. The wind and rain help to break up the rocks to make soiL. . . ( ) 

2, Most of the com grown in the North Central States is shipped 


to Europe ( ) 

3. Wheat is the chief crop of the South Central States ( ) 

4. Cotton is raised extensively near Philadelphia ( ) 

5. Pennsylvania is a great coal-mining state ( ) 

6. Rubber is produced from the wood of the bamboo tree ( ) 

7. The region around Para, Brazil, is noted for the production of 

coffee ( ) 

8. The western part of Brazil is densely populated ( ) 

9. Most of the pec^le of Colombia and Venezuela live on the 

plateau of the Andes. ( ) 
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10. Guiana is the only part of South America that belongs to 

European nations ( ) 

11. Many cattle are raised on the plateaus of the interior of Brazil. . ( ) 

12. Buenos Aires is a dty that is larger than New York ( ) 

13. Quito is the capital of Peru ( ) 

14. Chocolate is made from a root that is somewhat like a potato . . ( ) 

15. South America has a very good system of paved roads and 

railroads ( ) 

16. The United Elindgom of Great Britain and Ireland builds 

more ships than any other country in the world ( ) 

17. The ruler of the British Empire is chosen in a way similar to 

the way that the President of the United States is chosen ( ) 

18. Belgium is a densely populated coimtry ( ) 

19. Farming is a very important industry in Denmark ( ) 

20. Coal mining is very important in Norway ( ) 

HISTORY. 6 

1. The most important achievement of early man was learning to 

live in groups ( ) 

2. The building of mud and brush huts was also an imporant 

achievement ( ) 

3. The use of fire was a disadvantage ( ) 

4. Flint made better weapons than metal ( ) 

5. The Egyptians discovered that the year was 365 and J days 

long ( ) 

6. The Hebrews built large pyramids ( ) 

7. The Hebrews gave civilization the idea of worshiping just 

one God ( ) 

8. The Phoenidans divided the year into months, weeks, days, 

hours, and minutes ( ) 

9. The people of the Tigris-Euphrates valley traveled widely and 

carried with them the learning of Egypt ( ) 

10. The Greeks gave the world a wonderful literature which has 

been read and valued by all peoples ( ) 

11. The greatest help of the Greeks was the idea of the people 

having a voice in the government ( ) 

12. The Greeks built wonderful roads through the lands that they 

conquered ( ) 
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13. Most of our laws are based upon the laws of the Greeks ( ) 

14. Rome made use of the art and learning of Greece ( ) 

15. To the Teutons we owe the idea of each man’s being free to 

express his own belief ( ) 

16. Alfred the Great was a wise king and a good scholar as well . . . . ( ) 

17. Alfred the Great was King of— ( ) 

18. King John was forced to sign the— ( ) 

19. The were organized to protect the Christian 

pilgrims who traveled to Jerusalem ( ) 

20. The was the most powerful man in Europe 

during the Middle Ages ( ) 

ART. 6 

1. Red, blue, and yellow are primary colors ( ) 

2. Red and blue make — ( ) 

3. Blue and yellow make — ( ) 

4. YeUow and red make — : ( ) 

5. Black is a (1) standard color, (2) neutral color ( ) 

6. Red and green make — ( .) 

7. Orange and blue make — ( ) 

8. Violet and yellow make — ( ) 

9. Gray is a neutral color ( ) 

10. White is a neutral color ( ) 

11. Warm colors are restful to the eyes ( ) 

12. Blue, violet, green, gray, and white are cool colors ( ) 

13. Blue, blue-green, and blue-violet are colors { ) 

14. Orange, red-orange, and yellow-orange are coIots ( ) 

15. Red, yellow, and orange are warm colors ( ) 

MUSIC. 6 

1. A sharp before a note (1) raises, (2) lowers the pitch ( ) 

2. Do and keynote are the same. ( ) 

3. Do, mi, sol in any key form what chord?. ( ) 

4. What sign placed before a note lowers its pitch? ( .) 

5. The sharp farthest to the right in the key signature 

is always what syllable? ( ) 

6. The word riUnd means (1) fast, (2) gradually slower, (3) 

loud ( ) 
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7. The word forte means (1) fast, (2) gradtially slower, (3) 


loud ( ) 

8. The word means — ( ) 

9. Give the letter which indicates the music is to be played or 

sung softly ( ) 

10. Give the sign meaning 'Very loud” ( ) 

11. The flat farthest to the right in the key signature is 

always ( ) 

12. In two measures of 4-4 time there are how many quarter 

notes? ( ) 

13. In 6-8 time a quarter rest receives two beats ( ) 

14. There is a tie between two whole notes in 4-4 time; how 

many measures would you hold the same tone? ( .) 

15. In 24 time an eighth note followed by a dot requires 

what kind of a note to complete the beat? ( ) 

MORALS AND MANNERS. 6 

1. It often takes great moral courage to tell the truth ( ) 

2. The way boj^ and girls dress has much to do with making a 

good impression on other people ( ) 

3. A pupil who throws lunch scraps on the school grounds is 

helping to make his school attractive ( ) 

4. The comfort of the home depends largely on the helpfulness 

of its boys and girls ( ) 

5. Girls and boys show kindness by speaking disagreeably about 

those who are absent ( ) 


6, For boys and girls to devel(^ into good citizens they need (1) 
merely to avoid breaking the law, (2) merely to learn their 
lessons at school, (3) to learn all they can at home and at 
school and to take part m as many as possible of the different 


forms of group life, such as the playground group, the school 
club, and the like ( ) 

7. If a young person repeatedly takes little things that do not 
belong to him, he is (1) really forming the habit of stealing, 

(2) exercising the right of every person in a free country, (3) 
doing no harm unless he gets caught ( ) 

8. If a boy is loyal to his gang, (1) he is doing something wrong, 

(2) he proves he has a splendid quality, (3) he shows a lack 

of good sportsmanship ( ) 
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9. Thrift (1) applies only to putting raoney in a savings bank, 

(2) means going without necessary food or clothing, (3) re- 
quires careful and thoughtful use of clothes, books, time, 
money, strength, of all that one has ( ) 

10. If you should find something on the playground that is not 
yours, you should (1) keep it and say nothing about it, (2) try 

to find the owner, (3) throw it away ( ) 

11. A pupil is dependable if he behaves well (1) when the teacher 

is in the room, (2) when a viator is present, (3) when the 
pupils are alone in the room ( ) 

12. One way in which to show good school spirit is (1) to be waste- 
ful, (2) to be disrespectful, (3) to be honest ( ) 


13. Why does an employer question a teacher about a boy’s or a 
girl’s honesty and truthfulness when he is looking for ofiSce 
help? (1) He is interested in young people, (2) he is interested 
in the schools, (3) he feels that a boy or a girl who is honest in 


school win be honest dsewhere ( ) 

14. If another person expresses an oinnion different from your own, 

you should (1) make fun of him, (2) grant him the same right 
to his opinion that you have to yours, (3) inast that he 
change his opinion to conform with yours ( ) 

15. If a pupil is poorly dressed, you should (1) tdi himyou are sorry 
he is poor, (2) refuse to play with him, (3) pay no attention 

to his clothes ( ) 

PHYSICAL EDUCATION. 6 


In questions 1 to 6 inclusive imderline the word that is not related to 
the other four, and then put the number of that word in the parentheses 


at the end of the dotted line. 

1. (1) baseball, (2) indoor, (3) bat, (4) playground, (5) 

catdher ( ) 

2. (1) somersault, (2) handspring, (3) stunt, (4) cartwheel, 

( 5) game ( ) 

3. (1) bladder, (2) goal, (3) fish, (4) badcetball, (5) jumpiug ( ) 

4. (1) sneezing, (2) fresh air, (3) dry feet, (4) sunshine, (5) 

handkerchief. ( ) 

5. (1) health, (2) sleep, (3) vegetables, (4) coffee, (5) tooth- 
brush ( ) 

6. (1) net, (2) reaching, (3) kicking, (4) batting, (5) volley- 
ball ( ) 
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7. A ripe banana is a healthful food ( ) 

8. The triple-posture test is a corrective exercise ( ) 

9. The pull-up is a speed-testing fevent ( ) 

10. The eyes should be guarded from the direct rays of the sun ( ) 

11. Volley-ball is played on a space called a diamond ( ) 

12. Simshine kills disease germs ( ) 

13. The most important thing about a play-day is (1) the fun of 

(2) good fellowship, (3) healthful exercise ( ) 

14. Carrousel is (1) a posture test, (2) a running game, (3) 

rhythmical activity ( ) 

15. Nine-court basketball requires (1) three goals, (2) the playing 

space divided into nine sections, (3) ten players ( ) 


THE TEST IS OVER. IF YOU HAVE TIME, GO BACK OVER 
PART II AND MAKE SURE THAT YOUR ANSWERS ARE RIGHT 

SPELLING. 6 

The teacher will dictate the spelling in sentences. 

[Note: Space follows here for writing spelling test from dictation.] 


WRITING. 6, 

The teacher will score the writing, using Zaner and Bloser, Ayers, or any 
other good writing scale, allowing 1 to 25 points according to excellence. 

EXAMINATION II 

The State of Wyoming was one of the first states to 
employ extensively the objective examination, and is now 
one of the twenty-odd states administering uniform state- 
wide examinations. New Jersey has also pioneered in 
objectifying her state exaiminations. 

The following examination is largely the work of Miss 
Beatrice McLeod of the State Department of Education, 
State of Wyoming.^ It should be noted that this test is 
designed to cover a range of three grades; in this respect it 

^See also B. McLeod and H. Irving, '^Objective Examinations in the Rural Schools of 
Wyoming,*" Journal of Educational Eesearck VoL XVIII (1928), pp. 45-49. 
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resembles the standard test more closely than it does the 
traditional examination which was almost invariably 
plaimed for a single grade. 

Page 1— AGRICULTURE 

STATE OP WYOMING 
State Examination in Agriculture 
For Sixth, Seventh, and ES^th Grades 

Allow exactly 60 minutes. 

Name Grade Date 

School. — Age. Next Birthday. 


There are four pages of this test As soon as you finish one page, go on to 
the next. Use all the time you have. 


TEST I 


Directions: If the statement is true, imderline the word true. 

If the statement is false, underline the word false. 

Do not guess. If you are unable to decide whether the state- 
ment is true or false, let it alone. 


1. A Wyoming farma: does not need an education. 

2. We should kill as many birds as possible. 

3. It has been found that sugar beets are a profitable 
crop in Wyoming. 

4. A small flock of sheep is of no value to a Wyoming 
farmer. 

5. Chicks should be fed immediately after hatching. 

6. A Jersey cow gives very rich milk. 

7. All milk should be kept in a sanitary condition. 

8. It is not necessary to test seed com. 

9. IHversified fermiiig is the safest system. 

10. A good farm organizalion is of much value to a com- 
munity. 


TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 

TRUE 

FALSE 
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11* All insects are destructive and should be exter- 
minated. 

12. A hen that is very fat will not make a good layer. 

13. A dog is no value to a farmer. 

14. Com and tomatoes are the principal canning vege- 
tables. 

15. The people of Wyoming bring their scmb stock to the 
State Fair. 

16. The chicken is most important of the fowls. 

17. Bacteria never work in milk. 

18. The Merino sheep is noted for his mutton. 

19. John Burroughs is called *'the friend of birds.” 

20. Guano is a valuable commercial fertilizer. 


TRUE FALSE 
TRUE FALSE 
TRUE FALSE 

TRUE FALSE 

TRUE FALSE 
TRUE FALSE 
TRUE FALSE 
TRUE FALSE 
TRUE FALSE 
TRUE FALSE 


TEST II 

Directions: In the spaces before the column at the right place the 
NUMBER of the word that corresponds to the list in the column at the 
left 


1. Percheron 

A dairy cow 

2. Alfalfa 

.The smallest of living things 

3. Ayrshires 

A baby plant 

4. Bacteria 

External parasite 

5. Embryo 

Carriers of disease 

6. Tick 

Beef type of cattle 

7. Rats 

Xeading American crop 

8. Hereford 

A draft horse 

9. Com 

A breed of hogs 

10. Duroc Jersey 

A leguminous crop 


TEST III 

Directions : Fill in the blank with the word that makes the best answer. 

1. Plants like tomatoes and cabbages should be started in a. 

2. A plant that lives off other plants is called a 

3. The practice of plowing and cultivating a field one year, in order to 

grow a crop on it the next year, is called a 

4. The decayed animal and vegetable matter in the soil is. 
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5. Which is the most widely used and most important fiber? 

6. The principal disease of hogs is 

7. The danger of this disease has been lessened by. 

8-9. The two main types of hogs are 

10. Feed that supplies ingredients in the proper proportion and amount to 

meet the needs of the animal is called a 

11. What is the greatest enemy of the cotton grower? 

12. Is the earthworm a hindrance or a help to the fanner? 

13. Who is called the “plant wizard”? 

14. What large irrigation project is found in Wyoming? 

15. Where is a large irrigation dam under construction in Wyoming? 

16. What insect pest is the most numerous in Wyoming? 

17. What country leads in the production of hogs? 

18. The most useful insect is the 

19. Rabbits, prairie dogs, and gophers are very destructive 

20. What other animal besides the cow produces milk for family use? 

TEST IV 

Directions: Underline one word which completes the sentence. 

1. Alfalfa is a kind of com fruit hay. 

2. Bacon comes from the cow hog sheep. 

3. The tractor is used in farming mining racing. 

4. Rye is most like beans com wheat. 

5. Beets are used for making catsup sugar jdlies. 

6. Lard comes firom butter cattle hogs. 

7. A tree that will grow from cuttings is the oak pine willow. 

8. The Leghorn is a kind of cow fowl goat. 

9. A crop which enriches the soil is clover potatoes tobacco. 

10. Milk testers were devised by Babcock Bell Edison. 

11. A plant that can be grafted is the apple-tree lily wheat. 

12. A good breed of dairy cows is the Holstein Durham Hereford. 

13. One of the leading crops in Wyoming is rice tobacco hay. 

14. The soil is enriched by osmosis propagation legumes. 

15. We get rid of the potato bug by spraying fumigating dipping. 
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TEST V 

Directions: Put a cross (X) before the best answer. 

1. We cull our flock of hens in order that we may have: 

A better looking flock. 

A better laying flock. 

A smaller flock. 

2. Crop rotation is practiced by farmers in order to: 

Lengthen the period of fertility of the land. 

Adapt the crops to the season. 

Prevent the land from lying idle through the winter. 

3. Leguminous plants are important because: 

They grow on soil too poor to support other crops. 

They return nitrogen to the soil. 

They are easily cultivated. 

4. A Wyoming farmer cultivates his com soon after a rain: 

To enrich the soil. 

To compress the soil firmly about the roots. 

To form a mulch and check evaporation. 

5. A farmer should protect the birds: 

Because of their beautiful plumage. 

Because of their sweet music. 

Because they destroy many harmful insects. 

TEST VI 

Directions: Write on each line the word or words which complete the 
sentences. Don’t waste too much time on one you do not know. Go on 
and come back to it later. 

1. The parts of a plant are (1) (2) , 

(3) (4) 

2. We cultivate a field to (6) 

(7) 

( 8 ) 

(9) 

( 10 ) 


(5). 
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3. The different kinds of soil are (11) (12) 

(13) 

4. Some of the legimiinous crops in Wyoming are (14) - , 

(15) , (16) 

5. Three important dry-farming crops are (17) 

(18) , (19) 

6. The four good ‘‘general purposes'' breeds of chicken are the (20) 

(21) (22) (23) 

7- The two best-known egg-produdng breeds are (24) 

(25) 

The largest meat-producing breed is the (26) 

8. Besides chickens the 'Wyoming fanner raises many large flocks of 

(27) (28) 

9. Weeds are harmful because they not only (29) 

but often (30) 


EXAMINATION III 

Exaiiiination III is the work of Mr. Sam Everett and Miss 
Effey Riley of East Hi^ School, Rodbester, New York. 
This test was constructed in January, 1928, and was revfeed 
the following June, the revision being reprinted ha-e by 
permission of the Rochester Board of Education. Note 
that this examination combines old and new types of testii^. 
(See Exercise IV for “essay” question and Exercise VI for 
use of controlled “short-answer” or association technique.) 

EAST HIGH SCHOOL 


AMERICAN BISTORT 
Term 

(HISTORY m-l) 

June, 1928 

General Directions: 

This examination has been carefully planned in order to test the different 
kinds of abilities in which we have tried to give you some training. Do 
not hurry. It is far more important to try and answer each exercise care- 
fully than to try to finish. Answer the questions in order. IMrections are 
given with each exercise. 
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Exerqse L To test both your knowledge of the relationship of a 
number of great Americans to certain historical periods and of their signif- 
icance within their own period. 


A. From the list at the right of the page, select the names of five people 
in each of the following periods. Write their names under the name 
of the period in which they were prominent. Note that some names 
may be used more than once and that there are some names which 
may not be used at all. 


1. The Colonial Period (1607-1763) 


2. Men who helped to form the U. S. Constitu- 
tion and favored its acceptance by the States. 


3. Leading men in sympathy with the Feder- 
alist Party and with Federalist ideals. 


4, Leading men in the early Republican Party. 


George Washington 
Roger Williams 
John Cabot 
John Calhoun 
John Adams 
Governor Clinton 
Patrick Henry 
Ann Hutchinson 
Alexander Hamilton 
Robert Morris 
Peter Stuyvesant 
Henry Hudson 
John Marshall 
John Jay 
Henry Clay 
James Madison 
William Penn 
James Monroe 
Thomas Jefferson 
John Winthrop 
Andrew Jackson 
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B, Indicate after each name two significant historical facts which come to 
your mind in connection with the man in question. 

1. Robert Morris. 

2. William Penn 

3. Alexander Hamilton. 

4. John Marshall 

5. Thomas Jefferson 

6. Roger Williams. 


Exercise II. To test your knowledge of important geographical facts 
of colonial times that have affected the development of American civili- 
zation. 

Place the number of the best answer on the line provided at the right 
of each statement. 

A. The great natural gateway from the Atlantic coast into the West 
that Americans knew must be held if the revolution of the 13 

colonies was to be permanently successful was 

(1) Chesapeake Bay, (2) Connecticut Valley, (3) Hudson 
V^ey, (4) James lUver, (5) St. Lawrence I^ver. 

B. The rise of manufacturing in New England was greatly aided by 

the fact that their ph 3 ^ical environment furnished 

(1) cold temperature, (2) all kinds of raw materials, (3) many 
navigable rivers, (4) easy communication with the West, (5) 
water power. 

C. The bulk of trade of the Colonial South was with 

(1) New England, (2) England, (3) West Indies, (4) Spain, 

(5) Different Southern Colonies. 

D. The most important trade of the New England colonies, befcffe 

the American Revolution, was with 

(1) South America, (2) West Indies, (3) Spam, (4) Far 
West, (5) England. 
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E. The principal aop of Vii^a in colonial timpg was 

(1) wheat, (2) tobacco, (3) potatoes, (4) cotton, (5) furs. 

F. The industry that brought to Colonial Massachusetts the great- 

est prosperity was 

(1) potatoes, (2) com, (3) hemp, (4) fish, (5) hats. 

G. The wealth of Colonial South Carolina came chiefly from 

(1) rice, (2) tobacco, (3) cotton, (4) furs, (5) wheat. 

H. The character of soil, climate, and location allows the develop- 
ment of a great variety of agricultural products in 

(1) New England, (2) Middle States, (3) Virginia, (4) 
Appalachian Mt. Region, (5) Georgia. 

Exercise III. To test your ability to recognize clearly conditions of 
social environment of different periods of American history. 

Each of the statements below can be completed by one of the five 
different numbered phrases. Read each statement. Decide which of 
the numbered phrases, when added to the original statement, will make it 
true and complete. Then place the number of the completing phrase on 
the dotted line at the right of the statement. 

A. During the colonial period democracy was 

(1) common in all the colonies, (2) confined to Rhode Island, 

(3) everywhere limited by property qualifications, (4) stamped 
out by the Crown, (5) found only in Massachusetts as a result 
of the Mayflower Compact. 

B. In colonial times voting was the privilege of 

(1) all the inhabitants, (2) all male whites over 21 years, 

(3) those bom or naturalized in the U. S., (4) all persons who 
could read or write, (5) those who possessed either religious 
or property qualifications, or both. 

C. In the Middle Colonies the dominant character developed was 

(1) Yankee, (2) Puritan, (3) Planter, (4) Poor white, 

(5) Quaker. 

D. The character of the Puritan may best be described as 

(1) delightful, easy-going, aristocratic, (2) friendly, equality- 
loving, thrifty, peaceful, (3) strict, thrifty, deeply religious, 
intolerant, conscientious, (4) rough, self-reliant, adventure- 
some, (5) shiftless, poor, ignorant. 
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E. The Middle Colonies were peopled mainly by 

(1) English, Scotch-Iiish, and Welsh, (2) French, Germans, 
English, (3) French, English, (4) English only, (5) German 
English, Swedes, Dutch- 

F. The Appalachian chain of moimtains was chiefly significant in 

our early colonial history because 

(1) it sheltered Indian marauders, (2) it hindered the westward 
advance of the colonists, (3) its fine timber was used for ship 
building, (4) it contained rich iron and coal deposits, (5) it 
contained rich grazing lands. 

G. 'The good education of children is of singular benefit to any 

community*’ was first declared an American ideal by 

(1) Virginia, (2) New York, (3) Pennsylvania, (4) Massa- 
chusetts, (5) Georgia. 

H. The American ideal of equality was a most nearly realized fact 

(1) in New England, (2) with the Dutch in N, Y., (3) on 
Southern plantations, (4) on Western frontier, (5) in the 
Quaker settlements of Eastern colonies. 

I. Religious toleration in colonial times was foxmd in 

(1) Virginia, (2) Massachusetts, (3) New Hampshire, (4) 
Maryland, (5) Georgia. 

J. At the time of the adoption of the Constitution the states were 

(1) all in favor of the new constitution, (2) anxious for separate 
constitutions, (3) in favor of it except the New England cc^- 
onies, (4) within two years mostly in favor of it, (5) never 

in final agreement concerning it. 

K. By the time of Jefferson’s first inaugural, voting privil^es in 

most of the States were 

(1) extended to all, (2) restricted to manhood suffrage, (3) 
still limited by property and religious restrictions, (4) limited 
only by religious restrictions, (5) given to many Indians who 
had become naturalized. 

L. The chief significance of the capture of the government by the 

Jacksonian Democrats as we see it today was that 

(1) it brought an end to “The Era of Good Feeling,” (2) it 

gave western politicians control of its government, (3) the 
National Bank was abolished, (4) Jackson was a great Indian 
fighter, (5) many democratic institutions such as white man- 
hood suffrage came to be established. 
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Exercise IV. To test your ability to see how an intelligent knowledge 
of past events helps us to understand present-day situations, and tendencies. 
(Note: Write your answer in essay form on a separate sheet of paper.) 
Some one has said that we study the past relationships in American life 
in order to be able to understand the present in our civilization and that 
we need to understand the present so as to influence American national 
development toward finer things. 

State your reasons for every position assumed, 

a. Take some economic feict or group of facts in American History about 
which we have studied and brieflLy show what seems to yx)U to be the 
actual significance of this fact in the past, present and future of America. 

b. Show this same three-fold relationship using some political fact or facts. 

c. Show this same three-fold relationship using a religious fact or facts. 

Exercise V. To test your ability to recognize some of the precious 
social heritages that have come down to our present-day America from 
the past 

A. What do we mean by ‘"social heritage”? 

[Editor's Note: In the original, sufl5dent space was allowed here for writing the 
response to question A.| 

B. Below is a list of statements. Indicate by a cross (X) after it each 
statement that expresses a sodal heritage of the present-day American 
nation. 

Place a (0) after each statement that is not a present-day social heritage 
of the American nation. 

1. Americans believe in the ideal of religious toleration. 

2. Property in land should be inherited by a man's eldest son 

3. Citizens should have the right to say what taxes should be put 

upon them. 

4. No man's house shall be searched for evidence of law violation 

unless the searchers have a written permit stating exactly what 
they are searching for, and who made the charge of law vio- 
lation. 

5. The majority of citizens shall always have the right to state 

what shall be the religious faith practiced in the community 

6. Government should interfere with men's lives and fireedom as 

little as possible. 

7. An ideal of society is a belief in the union of church and state. 
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8. States may have many sovereign powers, and still be obedient 

to some higher central authority in some common matters that 
affect several states alike. 

9. Government shall be separated into three powers: executive, 

legislative, judicial. 

10. An aristocratic class is necessary in order that a nation shall 

have fine, intelligent, moral and cultural leadership. 

Exercise VI. To test your ability to recognize and judge the signifi- 
cance of certain famous events in our national history. 

Indicate by a descriptive sentence your exact historical knowledge of 
each of the following, and its significance. 

[EDiKXt’s Note: In the original, space for the resp(ffise was allowed after each item.] 

1. The introduction of tobacco culture 

2. The founding of Pennsylvania 

3. The invention of the cotton gin 

4. The Philadelphia Convention 

5. The American Bill of Rights 

6. The Lewis and Clark Expedition 

7. Jefferson’s first inaugural 

8. Jackson’s first administration 

ExERasE VII- To test your ability to recognize political theories and 
(pinions of prominent party leaders. 

The following is a list of theories held either by Alexander Hamilhm oc 
Thomas Jefferson. Each theory is numbered, l^ce the number of each 
theory either after the name of Jefferson or after the name of Hamilton 
depending on whether you feel it was believed by the one or the other. 
Note that certain theories may be held by both men. 

Hamilton. Jefferson 

1. There should be a strong central government in the United States. 

2. The national government shoiild pass laws which should chiefly benefit 
thei>oor- 

3. There was no need in the Constitution of guarantedng certain *’in- 
alienable rights” to the people. 

4. Poor people are likable and can be trusted. 

5. The national government should spend money on internal improve- 
ments, especially on roads into the west 
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6. The United States should become an industrial nation. 

7. The states should not give up many of their rights in order to strengthen 
the national government. 

8. The national government's credit and position would be best strength- 
ened by decreasing the national debt. 

9. The United States should become an agricultural nation. 

10. “Your people is a great beast." 

11. The French Revolution should be distrusted and condemned. 

12. The national government should pass laws which should principally 
benefit the rich. 

13. The frontier sections of the country are distrusted, and there is refusal 
to aid these people with national government funds. 

14. It would strengthen the national government to assume certain former 
national and state debts. 

15. It would be best for the United States government in the long nm not 
to pay any bribes in order to protect our commerce. 

16. The French Revolution was a splendid movement and made for the 
betterment of mankind. 

17. The Bill of Rights should be strictly enforced. 

EXERasE VIII. To test your ability to see and explain different ways 
in which environment and people may affect each other. 

(Note: Write your answer on a separate sheet of paper.) 

Most people think that the immediate environment in which we live 
largely determines the ideas of the majority of us, and that that environ- 
ment, so far as each of us is concerned, is apt to be quite accidental. 

a. Do you think this statement is true as it applies to ordinary people in 
such distinct periods of our country's history as Puritan New England, 
Colonial Pennsylvania, or our own generation? Give illustrations or 
factual evidence on which you base your opinion. 

b. Is it possible for individuals to have any moral or intellectual standards 
independent of their immediate environment? Discuss this, basing 
your opinion on facts taken from the lives of such men as Roger Wil- 
liams, Benjamin Franklin, Thomas Jefferson, or any other men that 
you care to take. 

c. Is the attempt at control or change of one's environment important or 
not for the life of each of us? On what factual evidence do you base 
this judgment? 

d. If it could be attained, what do you think would make the best kind of 
social environment for the lasting success and happiness of each of us? 
Would this apply equally to America as a whole? 
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Exerqse IX. To test the opinions you have formed on certain facts 
in American history. 

Below are listed various statements about early American history. Draw 
a circle around the letter or question mark which best indicates the way 
you feel about each statement, as follows: 

^ yon have a feeling in favor of the statement, draw a 
circle around R. 

T? 9 ^ feeling against the statement, draw a circle 

aroxmd W. 

R If you are quite uncertain as to knowledge or feeling, draw 

" a circle around the ?. 

Mark every item. Omit none. If you do not understand any item, 
simply put an X before the item. 

R ? W 1- The Puritans and Pilgrims came to America because of 
their suffering from religious persecution at home. They, 
therefore, determined to make religious toleration a corner- 
stone of their religious beliefs and of their government in 
the new world. 

R ? W 2. American colonists in general early welcomed and treated 
as equals all oppressed peoples, including Jews and 
Catholics. 

R ? W 3, None of the colonies in colonial times bdieved in and 
practiced religious toleration, 

R ? W 4. The German settlers coming to our country in colonial 
times made the poorest settlers because they were both 
ignorant and militaristic. 

R ? W 5. White men, as well as black men, were enslaved in early 
America. 

R ? W 6. Slavery never flourished in the Northern colonies because 
the institution of slavery was against the Christian ideals 
of the Pxiritans. 

R ? W 7. Colonial life in America after the first twenty-five years of 
settlement was pleasant and easy for the majority of the 
colonists. 

R ? W 8. The Quakers were very unpopular with the Puritans of the 
New England colonies because of their religious beliefs. 

R ? W 9. Colonial trade and industry became so large that it excited 
the fears and jealouaes of English competitors. 

R ? W 10. The Pilgrim Fathers were kind to everyone no matter 
what their religious beliefs. 
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R ? W 

R ? W 

R ? W 

R ? W 
R ? W 

R ? W 

R ? W 

R ? W 

R ? W 

R ? W 

R ? W 

R ? W 
R ? W 

R ? W 


11. Among the free inhabitants of America until after the 
Revolution, social life was upon a basis of almost absolute 
equality. 

12. Almost without exception, the people coming to America 
in colonial times represented the very best stock (physical, 
mental, social) of the countries from which they came. 

13. Before the Revolution the rough life of the colonies had 
prevented them from making any notable contributions to 
science or other branches of learning. 

14. At least one-half of the immigrants in America before the 
Revolution were slaves or bond servants. 

15. The Indians with whom the colonies carried on warfare 
were extremely ferocious, and practically alwa}^ their 
attacks on the settlements were sudden and unjustified. 

16. The (Juakers of Pennsylvania had no colonial militia with 
which to overawe the Indians; and yet, an Indian uprising 
against them was comparatively unknown. 

17. Rhode Island, founded by Roger Williams on the ideal of 
religious and political freedom, was from the first one of 
our most orderly and successful colonies. 

18. The majority of the settlers coming to America were 
Puritans, and for this reason their ideals became American 
ideals. 

19. Since a great number of the colonists had come to America 
for political freedom and to foimd governments on demo- 
cratic ideals, full manhood suffrage was granted in every 
colony from the first. 

20. The major reason why slavery did not flourish in the New 
England colonies was because it was not a good financial 
proposition- 

21. Before the Revolution there had grown up dissension and 
strong feelings of antagonism between certain sections of 
the frontier and the Atlantic coast communities. 

22. The democratic town meeting was the typical type (al- 
most universal) of local government in the thirteen colonies. 

23. The American colonies before the Revolution presented a 
“strange mingling of the uncouth, the totally wild, and the 
highly civilized and cultured.” 

24. An efficient and powerful national government was set up 
under the Articles of Confederation. 
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R ? W 

R ? W 

R ? W 

R ? W 

R ? W 
R ? W 

R ? W 
R ? W 
R ? W 

R ? W 
R ? W 
R ? W 

R ? W 

R ? W 


25. The National Government imder the Articles of Con- 
federation could do nothing to suppress popular disorders 
and rebellions. 

26. The men who wrote the Constitution drew it up in secret 
session where the public could know nothing of what 
was going on. 

27. The delegates to the Constitutional Convention at Phila- 
delphia with a few unimportant exceptions tried their 
best to establish the new government on the broadest 
democratic basis possible. 

28. The delegates to the Constitutional ConventicHi at Phila- 
delphia were more concerned in having clauses which 
protected private property than those protecting individual 
freedom. 

29. Many of the membem at this Convention had at one time 
been leaders in rebellion against law and authority. 

30. Alexander Hamilton as a member of the Conventicm whole- 
heartedly submitted a plan of government that proposed a 
monarchy, with Washhigton as first king. 

31. The Constitution represents a series of compromises rather 
than a document considered perfect by its signers. 

32. Alexander Hamilton smd of the Constitution that it was a 
‘‘flimsy” document, that “would not last a year.” 

33. Some members of our Constitutional Convention were 
influenced by their private or class interests in drawing up 
certain i)arts of the Constitution. 

34. While the Constitution was being drawn up sectional 
jealousies frequently divided the delates. 

35. After the drawing up of the Constitution it was ratified by 
the various states with little or no opposition. 

36. The supremacy of the U. S. Constitution as the ultimate 
authority over all the people has never been seriou^y 
questioned since its adoption. 

37. Intelligent people who study the Constitution in detail, 
as it works out in practice, feel that the founders gave to 
our people a nearly perfect document that will never need 
much change. 

38. The Constitutional CcMivention was peaceftd and en- 
countered no difference in issues. 
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R ? W 39. Our Constitution represents a new idea in government, 
created by the thought of the bnlliant men of genius who 
formed our Constitutional Convention. 

R ? W 40. Freedom of speech, of the press, of assemblage, of religion 
were guaranteed in the Constitution at the time it was 
originally drawn and first presented to the people for 
ratification. 

ExERasE X. To test whether or not the study of American History 
has meant to you mere “book-learning.” 

(Note: Write your answer on a separate sheet of paper.) 

A. Carefully explain the meaning of “book-learning” as it is used in the 
following quotation: 

“Education is not book-learning. It has to do with insight, with 
valuing, with understanding, and with the development of the ability 
to make a choice among the possibilities of experience.” 

B. As part of your education you have been studying in American history 
about the (Constitutional (Convention. Has the study of that historical 
event meant to you simply memorizing a list of facts or events, — or 
has it given you (1) insight into the significance of certain decisions 
made by the men of the (Constitutional Convention; (2) ability to 
evaluate certain clauses of our Constitution; (3) ability to decide 
whether our forefathers intended to give us a democracy, cm: not? 

If you have gained any of these three things, will you try to show that 
you have acquired them through use of practical illustraticais in each of 
the three cases? 

ExERasE XI. To test your ability to reason dearly using historical 
facts and truths as a basis. 

(Note; Write your answer on a separate sheet of paper.) 

A. From your study of American History illustrate the probable truth of 
the following statements by comparing several earlier periods with 
our own. 

“In this very uncertain world of ours, ways of living, standards of 
value, customs, and traditions are always in proc:ess of continual 
change.” 

B. Recognizing this truth, what do you think that an education should 
do for each one of us? Should we be taught what to think, or how to 
think? Use a number of illustrations in thinking through this problem. 
Make sure, in your answer, that you show that you understand the 
difference in meaning between the two phrases, “what to think/ * and 
“how to think.” 
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EXAMINATION IV 

Examination IV is one of a series of objective tests pro- 
duced in connection with the Summer Library Institute of 
the American Library Association held at the University of 
Chicago during the summer of 1926. The complete series 
includes tests on the following library-school subjects: 
Book Selection, Reference Work, Library Classification, 
Lending Methods, School-Library Administration, Children’s 
Work, and How to Use the Library. The last mentioned is 
given here, with omissions, as being the one of greatest 
interest to teachers other than librarians. It is the work of 
Miss Linda M. Clatworthy, Miss Sadie T. Kent, Miss Anna 
C. Lagergren, and Miss Delia V. Ovitz. General supervision 
of the construction of these tests was given by the instructors 
in the Summer Library School, particidarly Professor Sidney 
B. Mitchell and the present author. 

Although these tests are not generally available, informa- 
tion concerning a limited mimeographed edition of the same 
may be had by addressing the American Library Association, 
86 East Randolph Street, Chicago, Illinois. 

OBJECTIVE EXAMINATION 
Undergraduate Course: “How to use the Library*' 
True-False 

1. The Standard Dictionary, in defining a word, gives the 

literal or original meaning first. TRUE 

2. The New International Dictionary gives the common 

meaning first. true 

3. The definitions in the New International Dictionary are 
fuller than the definitions in the Century Dictionary, true 

4. The system for showing the pronunciation is the same 

in all the dictionaries. true 

5. The etymologies are fullest in the Century and briefest 

in the new Standard. true 

6. For abbreviations and foreign words and phrases, the 

New International or the New Standard is better than 
the Century. true 


false 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 
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7- The New International Dictionary gives antonyms. true 

8. All three dictionaries give synonyms. true 

9. For proper names, the fullest treatment is given in the 

Century. true 

10. The pages of the New Standard are divided into two 

sections; the words not in common use are put in finer 
print in the lower section of the page. true 

11. Webster's Dictionary gives the fiiUest account of the 

history of a word. true 

12. Oxford Dictionary is another name for Murray's New 

English Dictionary. true 

13- The two supplementary volximes of the Century Die- 
lionary published in 1909 were incorporated in the 1911 
edition in alphabetic order. true 

14. The information on any subject is so scattered in the 
New Intemationdl Encyclopedia that in order to be sure 

you have it aU, you must consult the index. true 

15. The Briiannica has very full cross-references. true 

16. The Americana is stronger in scientific and technical 

material than the New International. IRUE 

17. You win find the best treatment of the American Re- 
volution from the American viewpoint in the Briiannica. true 

18. The material in Nelson's Encyclopedia is kept up to date 

by a Yearbook. true 

19. The alphabetical arrangement in the New International 
is letter by letter instead of word by word; for instance. 

New Jersey, Newspaper, New York. true 

20. The Americana Encyclopedia and the New International 

cover much the same groimd. true 

21. The Americana is the only loose-leaf encyclopedia. true 

22. All encyclopedias follow the same scheme of arrange- 
ment of their subject-material. true 

23. The articles printed in the New International Encyclo- 
pedia are aU signed. true 

24. The Encyclopedia Briiannica is the most concise and 

comprehensive of all the encyclopedias. true 

25. The index to the main part of the New International 

Encyclopedia is in Vol. 24. true 

26. The Harvard Classics are published in ten volumes. true 

27. The Harvard Classics are often spoken of as Eliot's 

Five-foot Book Shelf. true 


false 

false 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 
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28. Whitaker's Almanac contains tables and lists, chiefly 
applicable to Great Britain; statistics and information 


about the government of all countries. trtje 

29. You can find the duties of a department of the U. S. 
government in the U. S. Congressional Directory. true 

30. A biography of Abraham Lincoln will be found in Who's 

Who in America. true 

31. The material in the Bartlett's Familiar Quotations is 

arranged alphabetically by subject. true 

32. In Hoyt’s Cyclopedia of Poetical Quotations all quota- 
tions from a given author are in one place. true 


33. The Publisher's Weekly gives you the author, title, 
publisher, and price of books published during the week, true 

34. If you wish a list of the important books of the year 
published in America on any subject consult the Cum- 


ulative Book Index for that year. true 

35. The Book Review Digest has a subject and title index, true 

36. The World Almanac is published bi-annually. true 

37. The Warner Library consists of extracts from the liter- 
ature of all countries. true 

38. The Book Review Digest is kept up to date by a monthly 

supplement. true 

39. The Statesman's Yearbook is devoted entirely to de- 
scriptions and statistics of the governments, industries, 

and resources of the U, S. true 

40. The Statesman's Yearbook is made up of long signed 

articles by specialists. true 

41. Poole's Index is a guide to magazine articles since 1900. true 

42. The Reader's Guide covers the 19th century. true 

43. The Reader's Guide is published monthly. true 

44. The Reader's Guide indexes aH of the important mag- 
azines published in the U. S. and foreign coimtries. true 

45. Inclusive pages are given for the articles in Poole's 

Index. TRUE 

46. If the library does not have a bound volume of Robert 

Frost’s poems, such poems as have appeared in maga- 
zines the past few years may be located through Poole's 
Index. TRUE 

47. A contemporary review of Scott’s Lady of the Lake may 

be located through the Reader's Guide. true 


48. Good debate material on the commission form of 

government may be secured from the Reader's Guide, true 


FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 


FALSE 
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49. The ‘‘Library of Congress” scheme is the scheme of 

classification most frequently used by libraries. true false 

50. The Dewey scheme of classification was devised by 

John Dewey, the psychologist. true false 


108. Full bibliographic information about books and articles 
referred to in the text can usually be found in foot- 
notes or bibliography at end of chapter or book. true 

109. A rather full outline of a book may sometimes be 

found in the table of contents. true 

110. Reading the preface of a book sometimes helps get the 

purpose for which the book can be used. true 

111. “Q” indicates book is larger than “F.” true 

112. In a dictionary catalog the subjects are usually in red. true 

Multiple-Response 

113. The Articles of Confederation can be found in 

1. New International Yearbook 

2. Statesman's Yearbook 

3. Harper's Encyclopedia of U. S. History 

4. Hart and McLaughlin— Cyclopedia of American Government 

5. World's Almanac 

114. Sketches of living Americans can be found in 

1. National Dictionary of Biography 

2. Lippincott’s Biographical Dictionary 

3. Who’s Who in America 

4. Appleton’s Cyclopedia of American Biography 

5. Century Cydope^ of Names 

115. The origin of famous names in fiction may be found in 

1. Baker’s Guide to Best Fiction 

2. Brewer’s Readers Handbook 

3. Statesman's Yearbook 

4. Readers’ Guide 

5. World Almanac 

116. The author and source of poetry and recitations may be found in 

1. Firkins — Index to Short Stories 

2. Granger — Index to Poetry and Redtations 

3. Book Review Digest 

4. Ward — ^English Poets 

5. A. L. A. — Index to General Literature 


FALSE 

false 

FALSE 

FALSE 

FALSE 
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117. The leading articles in all the important periodicals are indexed in the 

1. Cumulative Book Index 

2. Book Review Digest 

3. Reader’s Guide to Periodical Literature 

4. Card Catalog 

5. Gurrance — Guide to Periodicals 

118. The Statistical Abstract of the U. S. contains 

1. Statistical tables of the last census 

2. Abstract of deeds 

3. Current statistics 

4. Statistics about government 

5. Statistical maps of the last census 

119. In which book would you look to find a review of WiDa Cather's 
The Lost Lcidy? 

1. Poole’s Index 

2. Book Review Digest 

3. Cumulative Index 

4. A. L. A. Book List 

5. Reader’s Guide 

• « • • • 

157. If you wish to find a poem and can remember the first line, what 
reference book would you consult? 

1. Hoyt’s Cyclopedia of Quotations 

2. Granger’s Index to Poetry and Recitation 

3. Carman— World’s Best Poetry 

4. Dana— Household Book of Poetry 

5. Quiller-Couch — Oxford Book of English Verse 

158. For prommdation of places, where would you look? 

1. Rand McNally — Commercial Atlas 

2. Lippincott’s Gazeteer 

3. C^tury Atlas 

4. World Almanac 

5. New International Encyclopedia 

159. For concise articles on English history consult 

1. Ploetz— Epitome of Universal History 

2. Low and Pulling — ^Dictionary of English History 

3. Lamed — ^History for Ready Reference 

4. Brewer — ^Historic Notebook 

5. Heilprin— Historical Reference Book 
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160. Where would you find briefs and reports of intercollegiate debates on 
present-day questions? 

1. World Almanac 

2. Congressional Directory 

3. University Debaters" Annual 

4. Wilson Debaters’ Handbook Series 

5. Matson— Reference for Literary Workers 

161. Popularly written, yet scientifically authentic articles on any phase 
of agriculture can be foimd in 

1. Bailey — Cyclopedia of Agriculture 

2. Yearbook of Agriculture 

3. Statesman’s Yearbook 

4. American Yearbook 

5. World Almanac 

162. Where may you find authentic information on topics of current 
educational interest? 

1. New International Yearbook 

2. Statesman’s Yearbook 

3. World Almanac 

4. Book Review Digest 

5. U. S. Bureau of Education Bulletins 

Matching Exercises 

classification: dewey decimal 

163. 


1. 

No. 

100 

Classification 

General Works 

An5. 

( ) 

2. 

200 

Sociology 

U 

....) 

3. 

300 

Religion 

(-. 


4. 

400 

Philosophy 

( ) 

5. 

500 

Natural Science 

(-- 


6. 

600 

Fine Arts 

(-. 


7. 

700 

Philology 

(-- 

....) 

8. 

800 

Useful Arts 

(...- 


9. 

900 

Literature 




10. 

000 

History 

(- ) 

1. 

370 

Botany. 

u 


2. 

822 

French grammar 

(-. 

....) 

3. 

780 

American history 

(-- 

....) 
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4. 811 Economics C ) 

5. 150 Zoology ( ) 

6. 840 Psychology ( ) 

7. 580 Geology ( ) 

8. 973 Agriculture ( ) 

9. 330 Home economics ( ) 

10. 640 Printing ( ) 

11. 750 American poetry ( ) 

12. 590 Music.. ( ) 

13. 445 English dranm ( ) 

14. 630 French literature ( ) 

15. 550 Education ( ) 


Recall Questions 

165. Consult the catalog for author entry of the Proceed^ 

ings of the American Library Association under 

166. Consult catalog for the author entry of the Report of 

the Mass. DepL of Education under 

167. Consult catalog for author entry of our Federal 

Bureau of American Ethnology imder 

168. Consult catalog for the Confessions of St Augustine 

under 

169. The daily record of the debates and business of 

Congress is called the 

170. The chief book form of ancient Babylonia was the 

171. The chief book form of ancient Egypt was the 

172. The early book form of andent Greece and Rome 

was the 

173. ‘Tl.” in the catalog means 

174. ‘Tor.” in the catalog and periodical indexes means 

175. “PI.” in the catalog means 

176. “Enl. ed.” in the catalog means 

177. “Rev. ed.” in the catalog means 

178. “5 V.” in the catalog means 

179. Write out this abbreviation: 67th Cong. 4th. sess. 

H. Doc. 323 

180. The leading index to newspapers is the 

181. The symbols found in the Reader^s Guide, such as 
Lit Dig. 47:25-6 Je 15'24, mean 

182. Ibid means 

183. “cl926” means 
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184. "'q.v/' means 

185. *‘do.” means 

186. The modem successor to Poole's Index is 

187. In the field of foreign periodicals consult the 

188. In the field of technology periodicals consult 

189. In the field of agriculture and home economics 
periodicals consult 

190. In the field of business periodicals consult 

191. The Reader's Guide began indexing periodicals in 

192. Good check-lists of periodicals may be found in 
front of 

193. In the Reader's Guide a sin^e poem (author not re- 
called) may be found xmder 

194. In the Reader's Guide short stories (author not re- 
called) may be located under 

195. Book-review sections of magazines are indexed in 
the magazine indexes up to 

196- After above date, book reviews may be found in 


EXAMINATION V 

Examination V shows a most interesting and original 
approach to a very difficult type of measurement, viz., the 
appreciation of the qualities of literature. This test was 
constructed by Mr. Arthur Agard of the Alameda, Cali- 
fornia, High School. This examination won first prize in a 
competition with nearly one himdred objective tests in 
English. It also tied for first place among 375 entries 
representing eight principal groups of high-school subjects.^ 

Mr. Agard’s work represents a continuation of the Kne of 
development in the testing of literary qualities begun by 
Dr. M. F. Cari)enter of the State University of Iowa.® 

In the opinion of the author, the work of Agard and 
Carpenter represents one of the significant contributions to 
the technique of the objective measmement of the teaching 
of literature in the high school and college. 

^See G. M. Ruch and G. A. Rice. Specimen Objective Examinations. 

^Improvement of the Written Examination, pp. 86-90. 
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OBJECTIVE TEST 

ON THE QUALITIES OF A PASSAGE IN IJTERATDKE 

Test I ; Identification Test for One Quality— 20 Points 

Each of the following passages is especially noteworthy for some one of 
the following qualities: 

A. Skillful phrasing (compactness, the exactly right word, wording 
could not be changed without weakening) 

B. Adaptation of sound to meaning (mimetic words; appropriate 
rhythms; use of mutes, gutterals, aspirates, liquids, long vowels to 
produce desired effects) 

C. Beauty of image (attractive because of desirable emotion, memory, 
or imagination appeal) 

D. Force of image (definiteness, unusualness, vividness, striking to 
imagination, many points of likeness in figures) 

E. Worth of thought 

Place the letter for the quality at the left of the first line of the passage. 
The student is advised, in case he recognizes the source of the quotation, 
to consider the selection here given only. 

The student is advised to test each passage for each quality, and by 
the Method of Residuesi to eliminate all possibilities save the one quality 
finally determined. 

1. As rivers of waters in a dry place. 

As the shadow of a great rock in a weary land. 

2. As fOT the grass, it grew as scant as hair in leprosy. 

3. And now the sun has stretched out all the hills. 

4. And ten low words oft creep in one dull line. 

5. His honor rooted in dishonor stood. 

And faith, unfaithful, kept him felsely true. 

6, Battle's magnificently stem array. 

7. Let us sit, while my mind remembers 

The beauty of fire in the beauty of embers. 

8. The old order changeth, yielding place to new 

And God fulfills himself in many ways. 

9. Million-footed Manhattan, unpent, descends to her pavements. 

10. Or else, as if the world were wholly fair. 

But that these eyes of men are dense and dim 
And have not power to see it as it is — 

Perchance because we see not to the dose. 


iThe “Method of Residues” refers to successive eliminations. 
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11. Her eyes were deeper than the depth 

Of water stilled at even — 

12. The cataracts blow their trumpets from the steep. 

13. It's coming yet for a'that 

That man to man, the world o'er, 

Shall brothers be, for a'that. 

14. A chuckle of laughter like the tapping of unstrung kettledrums 

15* Magic casements opening on the foam 

Of perilous seas, in faery lands forlorn. 

16. The knight's bones are dust. 

And his good sword rust; 

His soul is with the saints, I trust 

17. Little flower,— but if I could imderstand 

What you are, root and all, and all in all 
I should know what God and man is. 

18. Without a word of warning, there 

In the autumn sky Mount Fuji stands. 

„-...19. A solitary shriek, a bubbling cry 

Of some strong swimmer in his agony. 

20. He rushed into the field, and foremost fighting fell. 

Test II : Identification Test for Two Qualities— 20 Points 

Each of the following passages is especially noteworthy for two of the 
following qualities: 

A. Skillful phrasing (compactness, the exactly right word, wording 
could not be changed without weakening) 

B. Adaptation of sound meaning (mimetic words; appropriate rhythms; 
use of mute, gutterals, aspirates, liquids, long vowels to produce 
desired effects) 

C. Beauty of image (attractive because of desirable emotion, memory, 
or imagination appeal) 

D. Force of image (definiteness, tmusualness, vividness striking to 
imagination, many points of likeness in figures) 

E. Worth of thought 

F. Effective contrast of main ideas 

Place the letters for the two qualities at the left of the first line of the 
jpassage. 

The student is advised, in case he recognizes the source of the quota- 
tion, to consider the selection here given only. 
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The student is advised to test each passage, for each quality, and by 

the Method of Residues to eliminate all possibilities save the two qualities 

finally determined. 

1. The Rank is but the guinea’s stamp. 

The man’s the gold for a’ that 

. 2. Though I speak with the tongues of men and of angels and 

have not charity, I am become as sounding brass a tinkling 
cymbal. 

3. The league-long roller thundering on the reef. 

4. Dirty British coaster, with a salt-caked smokestack. 

Butting through the Channel in the mad March days. 

5. In every adveraty of fortune, to have been happy is the un- 
happy kind of misfCHtune. 

6. As a white candle 

In a holy place. 

So is the beauty 
Of an aged face. 

7. The moan of doves in immemorial elms 

And murmuring of innumerable bees. 

8. Saw a gioomy-gladed hollow slowly sink 

To westward— in the deeps whereof a mere 
Hound as the red eye of an eagle owl 
Under the half-dead sunset glared. 

9. The chill 

November dawn, and dewy glooming downs 
The gentle showers, the smell of dying leaves. 

And the low moan of leaden colors seas. 

10. A good book is the precious life blood of a master spirit, em- 
balmed and treasured on purpose to a life beyond. 

Test III: Iehentification Test for Errors in Phrasing of Thought 

Among the many forms of errors in lo^cal or in tasteful expression of 

thought are: 

A. Anti-climax E. Tautology 

B. Mixed metaphor F. Redtindancy 

C. Inappropriate figure G. Over-elaboration 

D. Ponderous diction 

Place the letter for the error at the left of the first line of the example. 
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The piipil is advised to test each passage for the preceding errors and by 

the Method of Residues to eliminate all possibilities save the one finally 

chosen. 

1. With haggard eyes the Poet stood; 

Loose his beard and hoary hair 
Streamed like a meteor to the troubled air. 

2. By her own internal schism, by the abominable spectacle of a 

double Pope, the Church was rehearsing, as in still earlier forms 
she had already rehearsed, those vast rents in her foundation 
which no man should ever heal. 

3. As if the flower 

That blows a globe of after-arrowlets 
Ten-thousand fold had grown, flashed the fierce shield. 

All sun. 

4. Then Nature tries the earth if it be in tune 

And over it softly her w a rm ear lays. 

5. The Scripture moveth us to confess and acknowledge our man- 
ifold sins and wickedness. 

6. I will not suffer mine eyes to sleep. 

Nor mine eyelids to slumber. 

7. The last of men was Dr. Johnson to have abetted squandering 

the delicacy of integrity by multiplying the labors of talents. 

8. There was no light in heaven save a few stars; 

The boat put off overcrowded with their crews; 

She gave a heel, then lurched to port 

And going down, head foremost, — sunk, in short. 

9. She plows the billows like a hurricane, she throws the water 

from her bows, the wheels turn, the vessel starts. 

10. Thy hair is as a flock of goats 

That appeared fijom Mt. Gilead; 

Thy teeth are like a flock of sheep 
That are even shorn, 

Which come up from the washing. 

11. With nectar pure his oozy locks he laves. 

12. In the effort to eradicate the scourge of intemperance the 

slmnbering fires of passion were kindled. 

13. The dawn is overcast; the morning lowers and heavily in clouds 

brings on the day. 

14. That time of year thou may’st in me behold 

When yellow leaves, or none, or few do hang 
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Upon those boughs which shake against the cold, 

Bare ruined choirs where late the sweet birds sang. 

15. From the silence and deep peace of this saintly summer night, 

from the pathetic blending of this sweet moonlight, dawnlight, 
dreamlight, suddenly as from the woods and fields, suddenly 
as from the chambers of the air opening in revelation, suddenly 
as from the groimd opening at her feet, leaped upon her Death, 
the crowned phantom. 

Test IV : Identification Test for Historical Classification 
OF A Passage 

English literature between 1700 and the present day has passed through 
four Periods, which may be termed — 

A. Eighteenth century — classical period 

B. Transition period 

C. Nineteenth century — ^romantic period 

D. Modem and free verse 

A. Among the characteristics of eighteenth-century classical writing are 
critical analytic attitude, personificaticms of abstract ideas, heroic couplet, 
end-stopped lines, didactic morality, balance, antithesis, formal vocabulary, 
generalizations. 

B. Transition literature combined Eighteenth-century style with 
Nineteenth-century thought and feeling. 

C. Among the characteristics of Nineteenth-century romantic literature 
are sympathetic attitude especially to the oppressed; variety of figures, 
meters, rimes, rhythms; enthusiasm fca: nature; joy in the senses; dffort 
toward beauty in thought and expression; personality of author revealed; 
mood of melancholy. 

D. Among the characteristics of modem and free verse are avoidance of 
fixed rhythm, meter, or riming system; experimentation; grouping of 
non-related images; definite tmusual images; unrestricted choice of subject; 
language of common speech. 

The student should test the qualities of each selection, and identify it 
with one of the four historical periods. 

Place the letter for the period at the left of the first line of the passage. 

The student is advised, in case he recc^nizes the source of the quotation, 
to consider the selection here given only. 

1. Oh the wild joys of living! the leaping from rock to rock, 

The strong rending of boughs fi:x>m the fir tree, the cool silver 
shock 

®f a plunge in a pool’s living waters. 



254 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

2. Ill fares the land to hastening ills a prey 
Where wealth accumulates and men decay. 

3. Ye who listen with credulity to the whispers of fency and pursue 
with eagerness the phantoms of hope, who expect that age will 
perform the promises of youth, and that the deficiencies of the 
present day will be supplied by the morrow, attend. 

4. Sick for home 
She stood in tears amid the alien com. 

5. It's a warm wind, the west wind, full of birds' cries, 

I never hear the west wind but tears are in my eyes; 

It's a fine land, the west land, for hearts as tired as mine; 
Apple orchards blossom there, and the air's like wine. 

6. Once more the ass did lengthen out 
The hard dry hee-haw of his horrible bray. 

7. O wind, rend open the heat, 
cut apart the heat, rend it sideways. 

Fruit cannot drop through the thick air: 
that presses up and blunts the points of pears 
and rounds the grapes. 

8. He gave to Misery all he had,'— a tear; 

He gained from Heaven ('twas all he wished), a Mend. 

EXAMINATION VI 

Examination VI is offered as a further example of objective 
measurement in a somewhat difficult field, viz., musical 
accomplishment. ^ 

The feet that Examination VI is a standardized test does 
not lessen its value for our purposes. In fact, the teacher 
wishing to improve her objective-test methods can learn 
much from a critical study of various standard tests. 

The teacher of music will note that this test rises above 
the level of the measurement of factual knowledge. It calls 
for a great deal of performance and application of technical 
information. Test 10, Recognition of Familiar Melodies from 
Notation, is particularly suggestive. 

^The Kwalwasser-Ruch Test of Musical Acampliskment. Published and distributed by 
the Extension Division, State University of Iowa, Iowa City, Iowa. Reprinted by per- 
mission. Copyright, 1924, by Jacob Kwalwasser and G. M. Ruch. 
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EWALWASSER-RUCH TEST OF MUSICAL ACCOMPLISHMENT 

Pot Grades FV-XII 

Do not open this paper, or turn it over, until you are told to do so. Fill these 
blanks, giving your name, age, birthday, etc. Write plainly. 


Name Date 

(Fust name, initial, axid last name) 

Age last birthday. years. Birthday. 

(Month and day) 

Grade .Teacher. 

School City, 


How many years have you studied music in school?. 
How long have you studied music outside of school?.. 

(State your answer in half-hoar lessons.) 

Do not write bdlow this line. 


Test 

Name op Test 

Score 

1 

Knowledge of Musical Symbols and Terms 


2 

Recognition of Syllable Names 


3 

Detection of Pitch Errors in a Familiar Melody 


4 

Detection of Time Errors in a Familiar Mdody 


5 

Recognition of Pitch Names 


6 

Knowledge of Time Signatures 


7 

Knowledge of Key Signatures 


8 

Knowledge of Note Values 


9 

Knowledge of Rest Values 


10 

Recognition of Familiar Melodies from Notation 


Total 




Do Not Tom Ov^ The Page Until The Signal is GivenI 
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TEST 1 

Knowledge of Musical Symbols and Terms 

DiREcnONS: Below are twenty-five questions about music. Five 
answers are given to each question. Read each question and then draw 
a line under the right answer. The sample is already marked as it should 
be. 

Sample: J is called a sharp natural fat note rest 


Begin here: 

1 The first tone of the scale is mi re do fa sol 

2 J is called a rest natural sharp note flat 

3 The fifth tone of a scale is do fa mi sol re 

4 f is a flat note natural rest sharp 

5 V is a sharp flat natural note rest 

^ is a slur hold rest double-sharp repeat-bar 

7 ..i. is called a sharp flat natural note rest 

8 p means soft loud slow fast smooth 

9 ^ is called a bar staff measure accent clef 

10 ^ is a sharp flat natural note rest 

11 ^ is a clef staff measure accent phrase 

12 is called a def staff measure accent bar 

13 A is a def measure staff phrase accent 

14 sr-jr-j' ■ : the curved line is a slur fle hold accent rest 

15 ^ is a rest slur hold double-sharp repeat 

16 the curved line is a slur hold rest tie accent 

17 ^ means higher low^ louder repeat pause 

18 rr means higher lower louder softer pause 

19 AUegro means lively slow repeat accent sweetly 

20 f means fast loud slow soft smooth 

21 cresc. means softer louder slower fast^ smooth 

22 dim, means smoother louder softer fastm* slower 

23 LetUo means repeat accent sweetly dow lively 

24 Legato means soft quick separated connected loud 

25 Staccato means quick soft separated connected loud 

Test L Number right ^ Score 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 
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TEST 2 

Recognition of Syllable Names 

Directions: Below are five lines of notes. The first syllable in each 
line is so the n am e ^ has been written below it. You are to 

write the syllable names on the lines under the other notes. 



Test 2, Ntmber right -Score, 
TESTS 


DETEcnoN OF Pitch Errors in a Familiar Melody 

Directions: The song ^‘America” is written bdow. One measure 
has been crossed out because the melody is wrong. Five other measures 
are wrong. Hum over the melody to yourself and cross out all five wrong 
measures. 


Begin here: 




Test 3. Number right, 


XS^Sccre 
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TEST 4 

Recognition of Time Errors in a Familiar Melody 

Directions: The song “America" is written below. One of the meas- 
ures has been crossed out because it has the wrong number of beats. 
Five other measures are wrong. Hum over the song and cross out all five 
wrong measures. 


Be^here: 



Test 4, Number right XS^Score 


TEST 5 

Recognition of Pitch Names 

Directk^s: Below are four lines of notes. The first note in each 
line is already maiiced as it should be. You are to write the pitch or 
letter names on the lines under the other notes. 

Begin here: 


c 

-:!r- 

U ■ f 1 

J> 


\ — 1 » -=l 

F 

— 

j-- 1 

A 

— 

1 ' ^ 

Test 5. Number right ^Scare . . . . 
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Knowledge of Time Signatures 

Directions: Below are ten full measures. At the right of each are 
five time signatures. You are to draw a line under the correct time signa- 
ture for each measure. The sample is marked as it should be. 



The time signature is 

The time signature is 
The time signature is 
The time signature is 
The time signature is 
The time signature is 
The time signature is 
The time agnature is 
The time signature is 
The time Mgnature is 
The time signature is 


1 f t I I 

I I t I I 1 
f f t f I 2 
f i f I I 3 
I t 1 I I 4 
i i i I i « 
f i t i I « 
I i i f I f 
f f f I i 8 
f f i f I 9 
i f f i i 10 


Test 6. Number right. 


X2^Score. 
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TEST 7 

Knowledge of Key Signatures 

Directions: At the left below is a column of ten major key signatures. 
At the right is a column of five minor key signatures. You are to write 
the names of the keys on the lines at the right of each signature. 

Notice that there are two columns, one for major keys and one for minor. 



Test 7, Number right — Score 
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TEST 8 

Knowledge of Note Values 

Directions: In the measures below a note has been left out of each. 
You are to draw a line under the note needed to complete the measure. 
The sample is already marked as it should be. 


Sample: 


The note needed is J J 


Begin here: 


The note needed is J' J 

The note needed is J J 

The note needed is o 

The note needed is J o 

The note needed is 


J > 4 


Test 8. Number right == Scare . 


TEST 9 

Knowledge of Rest Values 

Directions: The five measures below are incomplete and need a 
rest to complete them. You are to draw a line under the rest needed to 
complete the measure. The sample is already marked as it should be. 


Sample: 


The rest needed is ^ 


Begin here: 


The rest needed is ^ ^ 

The rest needed is > ^ ^ 

The rest needed is y i 


The rest needed is — 


The rest needed Is f ^ 


Test 9, Number right 


X3^ Score 
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TEST 10 

Recognition of Familiar Melodies from Notation 

Directions: Below are phrases from ten songs that you know. Hum 
each line to yourself and then write the name of the song or the words of 
the phrase on the line at the right. 

The sample is already marked as it should be. 

Sample: 

. . t I t 


America or My Country His of Thee 
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Addenda. Mention should be made of growing 
tendency for writers of public-school textbooks and pro- 
fessional works on education to publidi tests paralleling the 
content of such books. An early example is the series of 
tests to accompany Beard and Bagley, The History of the 
American People, the Macmillan Company. Scott, Fores- 
man and Company publish a test parallding Frasier and 
Armentrout, An Introduction to Education. The same pub- 
lishers offer an extensive series of tests for use in connection 
with the Pieper-Beauchamp, Everyday Problems in Science. 
The Plymouth Press publishes a test by D. L. Geyer over the 
content of the Twenty-first Yearbook of the National Society 
for the Study of Education, which deals with intelligence 
testing. A number of other authors and publidiers are now 
planning tests to accompany their textbook offerings. 

Reference has already been made to the professional tests 
for selecting teachers in coimection with the work of the 
Bureau of Public Personnd Administration. Many univer- 
sities, particularly Columbia University, Univeraty of Iowa, 
University of Minnesota, and Ohio Wesleyan University, 
have prepared extensive tests for use with certain classes. 
Samples of these tests may be had in most cases. Particular 
mention should be made of the Aptitude and Training tests 
of the Iowa Placement Examinations series. These cover 
such college subjects as English, mathematics, physics, 
diemistry, modem foreign languages, law, etc., and are 
published by the University of Iowa Extension Division. 
For examples of some of the oMm: Columbia examinations 
see B. D. Wood, Measurement in Higher Education (1923) 
World Book Company. 

Weber has constmcted a “Standard Achievement Test 
on Aims, Purposes, Objectives, Attributes, and Functions 
in Secondary Education”; publisher. Public School PublMi- 
ing Company, Bloomington, Illinois. The tests prepared by 
the S umme r Library Institute have already been listed. 
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For further samples of objective tests the reader is referred 
to the section on '‘Sample Tests” in the General Bibliog- 
raphy at the end of this volume and to the previously cited 
volume of Ruch and Rice, Specimen Objective Examinations. 

In conclusion, the teacher is reminded that a careful study 
of the himdreds of standard tests in the elementary and 
high-school subjects will yield many invaluable suggestions 
as to approaches to the measurement of various school 
subjects. There are also a number of treatments of new- 
type or objective examinations listed in the section of the 
General Bibliography headed "Books, Monographs, and 
Bulletins.” All of these contain sample examinations 
and tests. 

For convenience the names of a number of leading pub- 
lishers of standard tests are given below. 

American Council on Education, 26 Jackson Place, Washington, D. C. 
Chicago, University of. Press, 5750 Ellis Avenue, Chicago, 111. 

♦Cincinnati, Bureau of Administrative Research of, University of Cin- 
cinnati, Cincinnati, Ohio 

Courtis, S. A., 9110 Dwight Avenue, Detroit, Michigan 
♦Ginn and Company, 15 Ashburton Place, Boston, Mass. 

Gregg Publishing Company, 20 West 47th St., New York City 
Harvard University Press, Cambridge, Mass. 

Houghton MifBin Company, 2 Park Street, Boston, Mass. 

♦Iowa, University of. Extension Divirion, Iowa City, Iowa 
Lippincott Company, East Washington Square, Philadelphia, Pa. 

♦I^blic School Riblishing Company, Bloomington, lU. 

Scott, Foresman and Company, 623 South Wabash Avenue, Chicago, 111. 
Smith, Hammand and Company, Atlanta, Georgia 
Southwestern Publishang Company, Cincinnati, Ohio 
♦Teachers College, Bureau of Publications of, Columbia University, 
New York City 

♦World Book Company, Yonkers-on-Hudson, New York 
(Starred publishers are the larger distributors of standard tests.) 



CHAPTER X 


RULES FOR DRAFTING OBJECTIVE 
TEST ITEMS 

TRUE-FALSE TESTS 

Advantages and limitations. This chapter will deal with 
the principal forms of objective tests only. Their merits 
and demerits will be pointed out briefly, and certain rules 
for framing test items will be given. The true-false test 
will be considered first. 

(A) Merits 

1. Purely objective 

2. Easy and rapid scoring 

3. Can be made to measure reasoning as well as 
memory for facts 

4. Rapidity of answering allows extensive sampling 
in limited time 

5. Wide applicability 

(jB) Limitations 

1. More difficult of construction than is commonly 
supposed (if ambiguities and partly-true-partly- 
false items are to be avoided) 

2. Open to guessing and chance effects to a marked 
degree 

3. Some subjects (e. g., the social and mental sciences) 
contain much that is controversial, and hence is 
neither absolutely true nor false. 

Rules for constructing true-false items. These rules, or 
any rules for that matter, can be of little more than general 
assistance in framing true-false items. They do serve to 

265 
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call attention to certain dangers and faults in test construc- 
tion. The rules follow.^ 

1. A true-false item {and any objective-test item) should 
observe the rules governing good language expression, grammar, 
spelling, punciuatim, and capitalization. 

Examples 

(Poor) Pasteuiization is where milk is heated to about 165° F. for 
thirty minutes to retard souring. (Italicized statements here and in 
following examples show the faulty constructions, etc.) 

{Better) Pasteurization is the heating of milk to about 165° F. for 
thirty minutes in order to retard souring. 

{Poor) Sufficient data is at hand to suggest that speed and accuracy are 
closely related in arithmetic computation. 

{Better) Sufficient data are at hand to suggest that speed and accuracy 
are closely related in arithmetic computation. 

{Poor) Benedict Arnold was a noted traitor to his country. 

{Better) Benedict Arnold was a notorious traitor to his country. 

2. Avoid the use of double negatives, since these increase the 
reading difficulty of the item and thus tend to throw success or 
failure upon a basis of reading comprehension rather than 
knowledge of subject-matter. 

Examples 

{Poor) Freezing weather is not entirely unknown in Florida. (True) 

{Better) Florida occasionally has freezing weather. 

{Poor) Scientists believe that perpetual motion is not impossible. (False) 

{Better) Scientists believe that perpetual motion is possible. 

3. Avoid introducing ''trick"" "catch,"" or puzzle questions. 

Examples 

{Poor) The Civil War began in 1861 B. C. (False) 

{Better) The Civil War began in 1861 A. D. (Or omit the A. D. entirely. 
If a false question is desired, change 1861 to some other date.) 

{Poor) Runyan Kipling wrote “Just So Stories.” (False) (Note: 
Scored “false” because of error in first name of author.) 

{Better) Rud 3 mrd Kipling wrote “Just So Stories.” (Or, if a false item 
is desired, substitute the name of some other author.) 

^For additional rules, see Weidemann, How to Construtt the True-False Exemination. pp. 
41-73. Fot similar rules for other types of objective t^ts, see Paterson, The Predation 
and Use of New-Type ExamnaiionSt pp. 14-66. 
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4. Avoid the use of items which are partly true and partly 
false. Some testers have made wide use of such statements, 
instructing the pupils to "'mark the statement false if any 
part of it is false'' Such statements also tend to fall to the 
level of "‘catch’' questions. A possible exception may be 
made with advanced college classes. 

Examples 

{Poor) The battle of Gettysburg, which is usually considered to be the 
turning-point of the Civil War, was fought in 1862, (False) 

{Better) The battle of Gettysburg was fought in 1862. (Note; Since 
the falsity of the item rests upon the date alone, omit the clause dealing 
with the turning-point of the war.) 

{Poor) Poe's writings are characterized by his bizarre imaginings, 
originality of theme, faithful character portrayals, and skill in arousing 
interest (False) 

{Better) Poe's writings are characterized by their fidelity to historic 
events, faithful character studies, and conventional vocabularies. (False) 

{Poor) Many experts consider the income tax to be a very just form of 
taxation because such a tax can readily be passed on or shifted to another. 
(False) 

{Better) Many experts consider the income tax to be a very just fcrai 
of taxation because such a tax cannot readily be passed on toanother, (True) 

5. Avoid long sentences with many dependent or modifying 
clatcses. There is no good reason why a true-false statement 
must be a single sentence. For the sake of easier comprehen- 
sion, it is often better to form two or three shorter sentences. 
It ^ould be remembered that the true-false test should 
measure knowledge of subject-matter, not skill in reading. 

Examples 

{Poor) The digestion of starches, which is commenced in the mouth by 
the action of the ptyalin of the salivary secretions, is stopped when the 
food reaches the stomach, which has an add reaction due to the presence 
of hydrochloric add; ptyalin being active only in neutral or alkalme media. 
(True) 

{Better) The digestion of starches is begun in the mouth under the in- 
fluence of the ptyalin of the salivary secretions. Ptyalin is active only in 
neutral or slightly alkaline media. For this reason the add of the gastric 
juice soon stops the digestion of starches after the food reaches the stomach. 
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Even in the second phrasing of this item, care must be 
taken to avoid the situation of the partly-true-partly-false 
type of item. It would be better to break such an item into 
two or three sepjirate items, somewhat as follows: 

1. The digestion of starches is begun in the mouth under the influence 
of the enzyme, ptyalin. (True) 

2. Ptyalin operates only in 2 in acid medium. (False) 

3. The gastric juices of the stomach are alkaline in reaction. (False) 
Etc. 

6. Avoid words which prejudice pupils’ replies. Such 
words as “alwasrs” and “never” occur in false statements 
about twice as often as they occur in true statements, ac- 
cording to Weidemann. If this is true, pupils soon learn 
that such words offer “clues” to the right answers. Weide- 
mann calls these specific deierminers. 

Examples 

(Poor) Animals always have the power of locomotion. (False) 

(BeUer) Some animals cannot move from place to place unless carried by- 
outside forces. (True) 

A pupil ignorant of the fact that certain animals are 
sessile, will mark the first statement “false” simply because 
the word “always” appears in the statement. He might be 
quite ignorant of the fact that certain animal forms are 
incapable of independent movement. 

(Poor) Water never runs up hill. (False) 

A pupU may be quite uninformed of the facts about 
siphons, pumps, etc., where water rises for short distances, 
and yet answer “false” because of the word never. 

7. Choose simple, everyday words in pyreference to nme 
technical or literary synonyms in framing true-false items. 

Examples 

(Poor) California and Florida produce large quantities of citnis fruits. 
(True) 

(Better) California and Florida produce large quantities of oranges, 
lemons, and grapefruit. 
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(Poor) Good harbors are often formed by diastrophism. (True) 

(Better) Good harbors are often formed by the gradual sinking of <xast 
lines near the mouths of rivers. 

8. Avoid having two items in the same test {or at hast near 
together) the answer to one of which suggests the answer to 
the other. 

Item 1. It is thought that the Norsemen visited America earlier than 
the first voyage of Columbus. (True) 

Item 5. No white man saw the shores of America before the time of 
Columbus. (F alse) 

Items 1 and 5 are likely to prove mutually helpful in 
answering both items; one should probably be omitted. 

SIMPLE-RECALL TESTS 

Advantages and limitations. Simple-recall questions form 
one of the most widely used objective test types. If care- 
fully phrased, such tests form an almost ideal compromise 
between the completion type proper and the recognition 
tyrpes (true-false, multiple-choice, etc.); the former being 
not quite objective, and the latter being objective but open 
to guessing. 

The simple-recall test is almost perfectly objective, and 
it is little subject to chance effects. 

(A) Advantages 

1. Almost entirely objective 

2. Guessing and chance scores almost negligible 

3. Fairly easy and rapid scoring 

4. A “natural” form of questioning; similar to the 
usual oral or written question 

{B) Lunitations 

1. Not quite perfectly objective 

2. Tend to be factuid in character (as a matter of 
avoiding subjectivity in scoring) 

3. Scoring somewhat more laborious than in the case 
of recognition t 3 q)es 
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Roles for constrocting simple-recall items. A few general 
comments are in order: 

1. Avoid items which can be answered by the exercise of 
general intelligence without knowledge of the subject-matter 
concerned. 

Examples 

(Poor) The force of gravitation causes water to run 

(Belter) The force which causes water to run down hill is called 
grMiUUion. 

(Poor) In the autumn, dedduous trees shed their 

(Better) Trees which shed their leaves in the autumn are termed 
dendu^. 

2. Since several answers for a given blank may be expected 
at times, make all meritorious answers center about a single 
idea. If all i)ossible answers which are worthy of credit are 
really nothing but variations of verbal statements of a single 
idea, little subjectivity results. 

Examples 

(Poor) Fruit may be kept for long periods by 

Other possible and meritorious answers are: “wrapping,” 
“(hying,” “canning,” “freezing,” etc.; or even “dealers,” 
“merchants,” “people,” etc. Such an item is of little value. 

[Better) Fresh fruits are often kept from spoiling during long shipments 
by 

“Refrigeration,” “ice,” “icing,” “cooling,” etc., are almost 
certain to arise as answers, but these are close S 3 Tion 3 Tns. 

3. Make the response call for a single word, date, short 
phrase, or in general a very short response. 

(Poor) The Civil War started over 

The expected reply was “secession.” However, there are 
many very different answers of considerable or equal merit 
which are certain to arise, e. g., “the firing on Fort Sumpter,” 
“states’ rights,” “slavery,” etc. 
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{Better) An immediate cause of the outbreak of the Civil War was the 
withdrav^ firom the Union, of the State of 

4. Avoid using the words ‘V’ and “an" immediately beware 
a blank whenever possible. These words offer slight clues to 
the ejcpected aiKwers. Possible solutions are (a) re-wording 
to avoid the indefinite article or (b) the use of such a device 
as “a(n).” Method (6) is somewhat unnatural but ade- 
quate. 

Examples 

{Poor) The unit of measurement of electric resistance is an. 

{Better) The unit of measurement of electric resistance is the. 

{Better) The unit of measurement of electric resistance is a(n) 


In the above case the only very likely answers are ohm, 
volt, ampere, or watt. Either “a” or “an” in such items 
reduce the probable field of choice to two possibilities, 
instead of four. 

5. Terminal and aligned blanks are more convenient in 
scoring than are staggered blanks. 

Examples 


{Poor) wrote “The Raven.” 

There are. lines in a sonnet 


The men of Odysseus were changed into. by 

Circe. 

{Better) “The Raven" was written by. 

The number of lines in a sonnet is. 

Circe changed the men of Odysseus into. 

coMPLEnosr tests 

Advantages and limitations. It should be remembered 
that the completion test originated as a test 'of general 
intelligence at the hands of Ebbin^us. Its use as a sub- 
ject-matter test requires that it be greatly modified from 
the form in which Ebbinghaus used it. 
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(^) Advantages 

1. Very free from guessing and chance effects 

2. May be used in almost any school subject 

3. Allows some freedom of expression and reasoning 

4. Is a “natural” form of questioning as it parallels 
the thought processes 

5. Is easy to prepare 

(B) Limitations 

1. Not highly objective miless great care is taken 

2. Unless the number of blanks in a given passage is 
kept small, the completion test tends to call for the 
exercise of general intelligence to too large an ex- 
tent 

3. Difficult to score because of the staggered arrange- 
ment unless cut-out stencils are used 

Boles for constructing completion tests. These are 
naturally similar to those for simple-recall items in many 
respects. 

1. Make each blank call for a single idea. (See Rule 2 for 
simple-recall tests.) It does not introduce marked subjectiv- 
ity if there are numerous equivalent answers, provided these 
are listed on the scoring key or stencil. 

2. Avoid a large number of blanks in a single sentence or 
paragxaph. Omit only a few critical words. 

Examples 

{Poor) The walls of the stomach secrete the geslnc juice which acts 
on changing them into and PjpJpf^es. The active enzyme 

is Pepsm but the gastric juice also contains an which activates the 
pepsin, 

{Better) The gastric juice contains the enzyme PpPpfp, which acts on 
It also ccHitains which m^es the pepsin active. 

{Poor) The principal of the ^fpf} War was 

{Better) The battle rsiially conddo^ to be the turning- 

point of the Civil War. 
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The first statement of the history item did not n^ake clear 
whether the first blanks referred to battles, cafises,)generals, 
results, or what, nor is it clear what war is rhe^t. The 
second form is better, although less broad in scope. 

3. Make all blanks of the same length in order to avoid giving 
clues to the length of the answer expected. 

Examples 

{Poor) The process of food manufacttire in green plants is called 

Th® products of this process are and carbon-dioxide, 

{Better) The process of food manufacture in green plants is 

The raw products of this process are 

and 

4. Do not attempt {as is sometimes done) to make each dot of 
a blank stand for a letter of the required word. Some test 
workers use this plan rather largely. It is open to the ob- 
jection that a pupil may know the answer but will not hit 
upon the particular synon3Tn that the test maker had in 
mind when he drafted the item. 

Examples 

{Poof) A precious metal heavier than lead is 

The expected answer was “platinum*^ but ‘*gold'' or ‘‘iridium” are 
equally good answers. 

{Better) A precious metal heavier than lead is 

{Poor) Plants with scattered fibro-vascular bundles are called 


The thirteen dots were intended to suggest “monocotyle- 
dons,” although “endogens” is equally good as an answer. 

5. It is ordinarily unwise and uneconomical to make conp- 
pletion tests by deleting occasional words from an actual passage 
in the textbook. Better completion exercises may be obtained 
by actual writing of suitable sentences or paragraphs de 
novo. 

For obvious reasons, it is difficult to give examples here. 
The teacher can test the soundness of this advice by actual 
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attempts to make completion exercises from quotations from 
a textbook. It can be done at times, although the procedure 
is wasteful. 

MULTIPLE-RESPONSE TESTS 

Advantages and limitations. Although the true-false test 
is, in one sense, a multiple-response test, we shall treat the 
latter here, as elsewhere, as a separate variety of recognition 
types. It is often convenient to view the simple-completion 
(recall) and the completion test proper as recall tests in the 
generic sense that they suggest little or nothing about the 
expected answer. Multiple-choice, true-false, matching, 
rearrangement, etc., t 3 T)es fall under the generic heading of 
recognition tests. Even the two-re^nse test, which has 
some features in common with true-false tests (e.g., 50:50 
chance of guessing correctly ; at least in theory) , will be shown 
in a later section to differ importantly from the true-false 
type of item. The rules governing the construction of 
multiple-choice and true-false tests are dissimilar enough to 
justify separate treatments. 

(^) Merits 

1. Fairly easy to construct 

2. Purely objective 

3. Usually more reliable than true-false tests, but ordinari- 
ly not quite so reliable as well-constructed simple-recall 
tests (when equal numbers of items are considered) 

4. May be made to test reasoning as well as facts if the 
response items are made long statements. The reference 
here is to the best-answer type of multiple-choice items. The 
best-answa: test lends itself to the use of several more or 
less extended statements which may be made to vary in 
merit from one entirely acceptable statement, through 
gradual degrees to one or more statements which are quite 
lacking in merit. 
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5. A sufficient number of statements can be used to 
minimize guessing to any desired degree (within practical 
limits). 

(J5) Limitations 

1. Test makers must guard against allowing these tests 
to become purely fact items. 

2. They are likely to be space-consuming, especially if 
the best-answer variety is employed. 

3. It is often difficult to find firom three to seven responses 
which present reasonable plausibility. In a four-response 
test, for example, if two responses are obvioudy unrelated 
and absurd, the choice often reduces to a situation of but 
two responses. 

Rules for constructing multiple-response tests. The 
following su^estions may prove helpful. 

1. Use at least four or five responses whenever possible in 
order to minimize chance smcesses. 

2. Choosedhe responses so that all, or at least most, of them 
have some degree of plausibility. 

Examples 

(Poor) The first president of the United States was John Adams 
Henry Ford Jack Dempsey George Washington Douglas Fairbanks 

(Better) The first president of the United States was John Adams 
Thomas J^erson George Washington James Madison Andrew 
Jackson 

3. Avoid wordings which serve as clues, e.g., changes in 
parts of speech, mixed singular and plural responses, etc. 

Examples 

(Poor) “Candid” means incisive candy frank slavery cowardly 

(Better) “Candid” means incisive secretive frank abrupt crafty 

(Poor) The name for the result in diviaon is sum factors quotient 
remainders products 

(Better) The name for the result in division is sian factor quotient 
remainder product 
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4. Avoid, when possible, the me of “a" or “an" as the final 
word prior to the listing of the responses; these words act as 
hints or clues. 

(Poor) The starfish is an insect sponge echinodenn protozoan 
coelenterate 

(Better) The starfish belongs to the group of insects sponges echin- 
oderms protozoans coelenterates 

(Better) The starfish is a(n) insect sponge echinodenn protozoan 
coelenterate 

5. Make the first, second, third, etc., responses the correct 
response in about equal numbers. 

6. Do not mix items with varying numbers of responses in 
the same test if the scores are to be corrected for chance [by the 
fonnula, Score=i?— pr/(n— 1)]. 

7. Do not allow the correct response to occur in the same 
position {order) for more than two or three successive items. 

MATCHmG TESTS 

Advantages and limitations. Matching tests have certain 
features not common to the other types of objective tests. 
(^) Merits 

1. Purely objective 

2. Easily constructed for certain types of subject matter 

3. Rapidly scorable 

4. May be used to measure either factual mastery or 
judgment 

5. Chance successes may be avoided by using ten or more 
pairs, or incomplete matchings. 

{B) Limitations 

1. Much subject-roatter does not lend itself to this method. 

2. If five or fewer pairs are used, chance enters appreciably 
in success or failure. 

3. Very long exercises (25-30 or more pairs) are wasteful of 
pupils’ time in searching out the proper pairings (without 
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bringing adequate conapensations by way of reducing 
guessing). 

4. When matching exercises deal with dates and chronol- 
ogies, the use of large numbers of pairs tends to throw dates 
so close together that such fine discriminations cannot be 
justified upon the grounds of social utility. 

Roles for constractmg matching tests. The following 
rules are based principally upon experimental findings of 
the author and his students. 

1. The optimum number of pairs to be matched is probably 
between ten pairs and twenty pairs; fewer than ten introdi*ces 
considerable of the chance element, and more than twenty is 
decidedly wasteful of time. 

2. If fewer than ten complete pairs are to be matched, make 
an excess of statements in one column or the other. 

Examples 

1453 (1534) Cartier discovered the St. Lawrence River 

1492 ( ) Sea trade with India established by da Gama 

1498 ( ) Defeat of the Spanish Armada 

1534 ( ) Capture of Constantinople by the Turks 

1588 ( ) First voyage of Columbus 

1608 

1619 

1754 

3. Avoid such clues as having some words plural and some 
singular. 


1. Enzymes 

2. Radius and ulna 

3. Pancreas 

4. Aorta 

5. Trachea 


A digestive ^and 

A large artery 

Bones of the arm 

.The wind-pipe 

........Digestive agents 


Example 
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4. Avoid having a small number of dates, or other distinctive 
facts, in a general list, since these obviously reduce the field of 
choice to selection among a very few alternatives. 


Example 


1. Battle of Bunker Hill 

2. Louisiana Purchase 

3. Spoils S 3 rstem 

4. Food relief in Belgium 

5. Famous panic 

6. Forest conservation 

7. ES^teenth Amendment 

8. Lee^s surrender 

9. Panama Canal 

10. Thirteenth Amendment 


3. Andrew Jackson 

Herbert Hoover 

Goethals 

1775 

Thomas Jefferson 

1865 

Pinchot 

Prohibition 

Slavery 

1893 


In the above example the pairing of the italicized items 
and responses (or the bold-face items and responses) is far 
more likely, by guess, than seems to be true at first sight. 



Part III 

EXPERIMENTAL AND THEORETICAL 
CONSIDERATIONS 




CHAPTER 33 


EXPERIMENTAL STUDIES ON NEW-TYPE 
EXAMINATIONS! 

Introducdon. This chapter will simunarize the principal 
comparative, experimental studies of the merits of the 
various types of objective test items imder four general 
headings: 

I. Studies of Comparative Validities 

II. Studies of Comparative Reliabilities 

III. Studies of Comparative Working Times 

IV. Studies of Comparative Difficulties 

COMPARATIVE VAUDITIES 

Brinkley’s study. Brinkley^ compared a number of types 
of tests by correlating each against a general criterion made 
up of a large number of tests (both old- and new-t 3 rpe), 
class marks, teachers' judgments, and pupils' judgments. 
These were combined into a single measure. Each objective 
test was correlated against this criterion. He also made 
two sub-criteria, one for thought and the other for information 
values of the tests and examinations. Selected findings by 
Brinkley are summarized in Tables 25 and 26. 

Brinkley's results show the objective types of tests to be 
somewhat superior to the essay examinations. The com- 

^Fart III Cespedally chapters XI. XII. and XIII) is mtended pdmarilv for serious 
students of educational measurement. It is to be hoped, however, that the classroom 
teacher will be interested in the more critical phases of examination constnictKin discussed 
in Part III. If the teacher’s srasp of the theory and ixactice of educational measinreinent 
is to nse above rule-of-thumb methods, certain experimental and statistical results must 
be examined critically. 

* S. G- Brinkley, ‘Walues of the New-Type KxamTnations in the High School.” Teachers 
College Contribtaions to Edueettions, No. 161 (New York: Columhaa Univearsity. 1924). 
espeoally pa«es 68-82 and 84-90. 
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TABLE 25 


Brinkley’s Correlations of 31-Minxjte Tests With His General 

Criterion* 


Type of Test 

Correlation with 
General Criterion 

True-false 

.82 =*=.023 

Multiple-choice 

.82^.023 

Completion 

.84 =*=.021 

Word or phrase answer 

.86 =*=.018 

Rearrangement 

.82 =*=.023 

Essay Test I 

.81 =*=.024 

Essay Test II 

.76 =*=.029 

Fall term marks 

.65 =*=.040 

Teachers’ judgments 

.88 =*=.016 

Pupils’ judgments 

.83 =*=.022 

Otis Intelligence Test scores 

.55 =*=.048 



♦Quoted from Brinkley’s Table V, op, ctL, p. 85. 


pletion and word-or-phrase-answer t37pes seem to be some- 
what superior to the true-false and mtiltiple-choice varieties, 
but these differences are too small to be statistically signifi- 
cant. 

Table 26 below reports the correlations of the seven dif- 
ferent types of tests with Brinkley’s thought and inform- 
ation criteria. 


TABLE 26 

Brinkley’s Correlations of 31-Minute Tests With His Information 
AND Thought Criteru* 


Type of Test 

Correlation With 
Information 
Criterion 

Correlation With 
Thought Criterion 

True-false 

.75t 

.70t 

Multiple-choice. 

.76 

.64 

Completion 

.78 

.74 

Word or phrase answer 

.76 

.75 

Rearrangement 

.70 

.64 

Essay (2 tests) 

.73 

.73 

Intelligence 

,66 

.70 


•Quoted from Brixikley’s Table VII, op. cit., p. 89. 
tThe probable errors range from .031 to .041. 
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Although the correlation of the new-t 3 npe tests with the 
thought criterion are somewhat lower than with the infor- 
mation criterion, the general sweep of the evidence supports 
the conclusion that the new-type tests are at least as valid 
as the essay examinations; especially the completion and 
word-or-phrase-answer tests. 

The experiments of DeGraff and Each. DeGrafif and 
Ruch, working imder a subvention from the New York 
QjmmonwealthFund, studied a number of t 3 TJes of objective 
tests. As a criterion, two simple-recall tests were constructed 
so as to be equivalent forms. The subject-matter was that 
of United States history. A total of 2533 pupils took both 
of these recall tests. Each pupil then took the “same” 
items once more on a different day in some recognition form 
(true-false, two-response, three-response, five-response, or 
seven-response). The groups taking the recognition forms 
were further subdivided by the plam of having half take the 
test with instructions to guess (when in doubt) and half 
with directions not to guess. 

To summarize, all pupils first took Recall, Form A (100 
items) and Recall, Form B (100 items); the total group was 
then ivided by chance into ten sub-groups, each sub-group 
taking one of the five above mentioned recognition editioiK 
of the “same” items with instructions either for or against 
guessing. 

The word “same” has been placed in quotation marks 
because it is doubtful whether the items may be held to be 
the same when changed from recall to the various recogni- 
tion types. The degree to which the items remained the 
same throu^out the various editions of the tests may be 
judged by the following sample item. 

Recall: Eli Whitney is noted for his invention of the 

T^esponse: Eli Whitney is noted for his invention of the (1) steam- 
boat (2) ^inniog jenny (3) cotton gin (4) telegraph (5) tele- 
phone (6) printing press (7) steam engine. 
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5-response: Eli Whitney is noted for his invention of the (1) steam- 
boat (2) spinning jenny (3) cotton gin (4) telegraph (5) tele- 
phone. 

3-response: Eli Whitney is noted for his invention of the (1) spinning 
jenny (2) cotton gin (3) telegraph. 

2-response: Eli Whitney is noted for his invention of the (1) spinning 
jenny (2) cotton gin. 

True-false: Eli Whitney is noted for his invention of the spinning jenny. 


Table 27 shows the validity coefficients (correlations of 
each test against the recall test as a criterion).^ 

TABLE 27 


Intercorrelations, Corrected and Uncorrected for 
Chance, for All Ten Tests Used 


Type OF Test 

Recall A vs. 
Recognition A 

RecallB vs. 
Recognition B 

Uncorrected 

Corrected 

Uncoixected 

Corrected 

7-response (g)* 

.871 ±.011 

.873 ±.011 

.816 ±.015 

.861 ±.111 

7-response (n)t 

.927±.006 

.926 ±.006 

.872* .012 

.898 ±.009 

5-response (g) 


.910 ±.008 



5-re^nse (n) 


.918±.007 

.836 ±.013 

.870 ±.010 

3-response (g) 

.838±,013 

.848 ±.012 

.797±.016 

.875 ±.010 

3-response (n) 

,845±.014 

.915 ±.007 

.852 ±.012 

.902 ±.008 

2-response (g) 

.859=b.012 

.865 ±.011 

.735 ±.021 

.806 ±.016 

2-response (n) 

,740±.018 

.775 ±.016 

.752 ±.018 

.868 ±.010 

True-felse (g) 

.804±.015 

.839 ±.013 

.675 ±.024 

.801 ±.016 

True-false (n) 

.749 ±.018 


.768 ±.017 

.856 ±.011 


Correlation of Recall A vs. Recall B 950 ±.001 

Coefficient of reliability (sum of Recall A and B) 974 ±.001 


Correlation of Recall A vs. Recall B 950 ±.001 

Coefficient of reliability (sum of Recall A and B) 974 ±.001 


*Cg) indicates the tests taken under instructions to guess. 
tCn) indicates the tests taken under instructions not to guess. 

Table 27 shows clearly that the recognition types of tests 
(if the recall tests may be accepted as a valid criterion) 


^Abridged from G. M. Ruch a alp Objective Examination Methods in the Social Shtdies 
(Chicago: Scott, Foresman and Co., 1926), p. 75. (The details of thja study are given in 
^erence.) A briefer report is given in the Journal of Educational Psyckoloey, Vol. XVII 
(September, 1926), pp. 36^75, 
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measure roughly the same abilities or functions. The 
correlations are moderately high in all cases, although it 
appears to be true that the larger the number of responses, 
per item, the more valid the test. The correlations pre- 
sented are never close to unity (1.00, or perfect correlation) 
for a number of reasons, particularly, (a) the fact that both 
the recall tests and the recognition tests are unreliable to 
some degree and hence cannot correlate perfectly, and (6) 
the fact that recall and recognition tests undoubtedly 
measure somewhat different abilities. The rough indication 
is that the various types of tests studied are not greatly unequal 
in validity, although true-false and two-response tests are less 
valid than the tests with a larger number of optional responses. 

It is further true that instructions against guessing com- 

W 

bined with the use of the chance correction formula S =R—^^ 

give somewhat more valid results than when directions to guess 
are employed (whether corrected or tmcorrected). This 
point will receive further attention in a later chapter. 

It would be interesting to know whether the lack of perfect 
correlation between the criterion (recall tests) and various 
t 3 rpes of recognition tests is due principally to («) tmrelia- 
bility or (6) differences in the abilities measured. The only 
way to attack this question is through recourse to what the 
statistician calls “correction for attenuation.” Coefficients 
of correlation may be corrected for the dilution arising from 
unreliability by the use of certain formulas derived by 
Professor Spearman and others. The resulting corrected 
coefficients of correlation are to be thought of as estimates 
of the probable correlation which would be foimd if perfectly 
reliable measures had been used.'- 

^The f OTtn of the formula for corrections fcK* attenuation used in obtaining the values in 
Table 28 is: 

Where: 

Vr-vv. •»'**v* is the corrected coeflSdent of correlation 

ToOxQOy — ^xiyx is the ccOTdation of Recall A with Recognition A 

vr-. ^Xiy% is the correlation of Recall B with Recognition B 

isthecwrclationof RecaUAwithRecaJlB 
^yiy% is the o^ielation of Recognition A with 

Recognition B 
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Table 28 shows the corrected coefficients. With two 
exceptions the uncorrected coefficients are above 0.900. 
When scores are corrected for chance, the values are always 
over 0.900. These results prove rather definitely that 
recall and recognition types measure roughly the same abilities. 

TABLE 28* * 

CORHELATION OF RECALL AND RECOGNITION WHEN 

Corrected for Attenuation 

Uncorrected for Corrected for 
Chance Chance 


7-response .967 .971 

5-response .974 .975 

3-response .916 .954 

2-response .945 .921 


True-false .943 .953 


UNCORRECTBD for OXtRECTED FOR 
Chance Chance 


7-response .980 .982 

5-response .953 .976 

3-response .925 .988 

2-response .838 .917 

True-false .827 .962 


*Ohjeciio€ Examinaiion Methods in the Social Studies, p. 77 

The reader unfamiliar with statistical methods may regard 
the corrected coefficients of correlation as promises of what 
would be obtained if infinitely long (and hence perfectly 
reliable) tests of both recall and recognition types had been 
correlated. It cannot well be mziintained ^t true-false 
and multiple-choice tests do not measure roughly the samA 
abilities as recall tests in the light of this evidence. Some 
writers and teachers have held that true-false tests, espe- 
cially, were largely matters of chance and hence not trust- 
worthy. Althou^ true-false and two- and three-response 
tests are open to much guessing and chance successes or fail- 
ures, nevertheless these tests, if made long enou^ to compen- 
sate for guessing, do measure roughly the same functions as 
do the recall tests whose validity is never seriously questioned. 
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There is one disturbing factor in Table 28 which needs 
comment, viz., the fact that this table compares various 
recognition tests of 100 items each, but overlooks the fact 
that the working times are very different for such tests and 
the true-false and seven-response. The question may well 
be asked whether comparisons upon the basis of equal work- 
ing times rather than equal numbers of items would not be 
fairer. The answer is undoubtedly in the affirmative, and 
such comparisons will be made when the question of rela- 
tive reliabilities is discussed. 

Wood’s inTestigations. Wood has done invaluable work 
in comparing the relative validities of old- and new-type 
examinations. In one report^ he gives extensive data on 
true-false tests in a number of college subjects. Like the 
previously reported study of DeGraff and Ruch, validity 
coefficients are reported for both corrected and uncorrected 
(for chance) scores. Wood’s criteria differ somewhat from 
one subject to the next; usually being a combination of 
other tests, instructors’ marks, and old-t3q)e examinatinns. 
The data of Table 29 are selected from Ms report. 

TABLE 29 


Sexected Validities Reported by Wood for College Courses 


Subject 

No. OF 
Items 

Score 

No. Right 

Score 

Rights Minus 
Wrongs 

French 

100 

.706 

.747 

Law (Pleading and Practice) , . 

180 

.845 

.868 

Law (Property) 

200 

.669 

.709 

Law (Torts) 

130 

.761 

.815 

Anatomy* 

130 

.654 

.632 

Anatomyt 

130 

.649 

.640 

Anatomy** 

130 

.766 

.766 

Averages 


.721 

.769 


^Criterion here is average of three one-hour essay examinations. 
fCriterion here is avera^ of all first-jear medical grades except anatomy. 
**Criterion here is a 200-item completion test. 


D. Wood. “Studies of Achievement Tests,’* Journal of Educational Psychologyt 
V6L XVII (1926), pp. 1-22, 125-129, and 263-269, 
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Using as a criterion six essay examinations. Wood has 
calculated the validity coeflBcients given in Table 30 for the 
three law examinations. 


TABLE 30 


Validities of Three Law Examinations as Given by Wood 


Subject 

No. OF 
Items 

Score = 

No. Right 

Rict^Minus 

Wrongs 

Pleading and Practice 

180 

.688 

.744 

Property 


.705 

.745 

Torts 

130 

.605 

.674 


Averages 


.666 

.721 




Although only true-false tests are involved in this study 
by Wood, the results give confidence in the conclusion that 
true-false tests are measures of mastery of subject-matter. 
It should be pointed out that these law examinations zure 
by no means tests of memory for facts, but that they also 
include a great deal of reasoning. The correlations reported 
are far from jierfect, but this is not surprising as the criteria 
are also far from ideal measures. 

The study by Paterson and Langlie. These authors^ 
report findings on a 100-item true-false test in general 
psychology. The criterion of validity here is average 
scholarship. The validity coefficients are reported for both 
“rights” and “right-minus-wrong” scores. 


Method of Scoring 

Validity 

No. of Items 

No. of Cases 

No. Right 

.44=fc.050 


111 

Rights minus Wrongs 

.39=fc.054 


111 


These validity coefficients are very low in comparison 
with the studies reported earlier. Two reasons are suggested 
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for such low values: (a) average scholarship is a very fallib le 
criterion, and (b) the true-false test used was rather too easy 
for good discrimination (the average score being about 84 
for “rights” and 71 for “rights-minus-wrongs.” It is 
also possible that this examination was not very adequate; 
indeed, Paterson and Langlie report its reliab^ty as 0.63 
for “rights” and but 0.54 for “rights-minus-wrongs.” 
Such values are certainly low for 100-item true-false tests 
as reported by other investigators. (See Charles’s results, 
which follow.) 

Charles’s study of five types of objective items in psychol- 
ogy examinations. Charles has carried on under the author’s 
direction a much more extensive study than that reported 
by Paterson and Langlie, and with ralJier different results. ‘ 
The general plan of the investigation was that devised by 
DeGraff and Ruch. Table 31 gives a summary of Charles’s 
findings. 

TABLE 31 

Charles’s Investigation of the Validities of Five Types of Objective 
Tests in Elementary Psychology 


Type op Test 

No. OF 
Items 

Validity CoEFFiaENTS 

CRITERION 

RECALL SCORES 

CRITERION = 

TERM MARKS 

Rights 

R-W 

Rights 

R-W 

5-Response 

3-Response 

2-Response 

True-false 

Recall 

50 

50 

50 

50 

50 




.23^.047 
.26=fc.046 
.41 ±.041 
.26±.047 






Charles’s validity coefficients computed against the 
criterion of term marks are even lower than those reported 
by Paterson and Langlie for average scholardiip. Clwles’s 


IJ. W. Charles, A Comparison of Five Types of Objective Tests in Elementary Psycholo^ 
(1926), unpublish^ doctor's dissertation, university of Iowa. A sununary appeared m 
the Journal o/ Applied Psychology, Vol. XII (1928), pp. 398-403. 















290 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

tests, however, were only half as long as those used by Pater- 
son and Langlie. When the recall test scores are taken as a 
criterion, the validities are much higher; in fact, Charles 
showed that the correlations between recall and recognition 
types were almost as high as could be expected in view of 
the unreliabilities of the measures. ^ If ^owance for un- 
reliabilities is made, the recall and recognition tests may be 
said to measure substantially the same abilities. 

Summary of the studies on comparative validities. The 
following conclusions seem justified in view of the findings 
of Brinkley, DeGraff and Ruch, Wood, Paterson and 
Langlie, and Charles: 

1. Where old- and new-type tests are compared, the new- 
type are at least as valid as the traditional examination. 

2. There is no reason to believe that the newer objective 
tests are impotent for the measurement of reasoning and 
thought in contrast with memory for facts. 

3. If recall tests are held to be valid (and there is no 
evidence to the contrary), recognition tests measure roughly 
the same abilities or functions. 

4. When validities are measured against school marks as 
a criterion, the correlations are lower than where long 
objective tests are used as the criterion of validity. Such a 
finding, however, is in line with the expectancy, since school 
marks are very unreliable and hence will not support high 
correlations. 

5. Instructions against guessing seem to give more valid 
results than where pupils are directed to guess. 

6. When validity coefficients are corrected for attenuation 
(errors due to unreliability of measurement), the resulting 
values are high, showing that true-false, multiple-choice, 
and recall tests measure roughly the same abilities. 

'Two sets of measures cannot correlate higher than the square root of the product of 
their reliability coeffioents, except as a matter of chance. TTie values found by Charles 
are very dose to such limits. 
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COMPARATIVE RELIABILITIES 

Toops’s investigation. Toops seems to have made the 
first important contribution on the subject of the relative 
reliabilities of various t 3 rpes of objective test items. ^ He 
made first a fifty-item recall test over general information, 
of which the following are samples: 

1. What letter designates the note on the bottom line of the staff in 
music? Ans. E. 

50. In what city is the Smithsonian Institute? Ans, Washington. 

The items were then changed into five-response recog- 
nition types, and then to true-false; e. g., Item 1 became: 

{Recognition) What letter designates the note on the bottom line of the 
staff in music? Ans. (A, E, G, B, C). 

{Trm-Jalse) The letter G is the note which is on the bottom line of the 
staff in music. True False 

Table 32 reproduces Toops's results (Table II, p. 49 of 
Toops’s monograph). 


TABLE 32 


Comparison of Reliability Coefficients of the Recall, 
Recognition, and True-False Tests 



Recall 


true-false 

Reliability (rn) of halves, 124 
cases. Two forms of 25 each 

.448 

385 

340 

Relkbility of two 50-question 
sets. (Brown’s formula, « =2) 

.618 

356 

307 

Average time in minutes to do 
50 questions 

6.9 

5.6 

3.6 

Number of questions per unit 
of recall time 

1.00 

1.23 

1.92 

Number of sets of 25 questions 
to get equal reliability of ,618 
Reliability of Form A with 

2.00 

2.60 

3.14 

Form B when 6.9 minutes of 
examination time are used. . . 

,618 

.607 

-664 


A. Toops, “Trade Tests in Education,” Teachers College ContTibutions to Education^ 
No. 115 (New York; Columl^ University, 1921), espedally pages 39-62. 
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The following conclusions seem justified from Toops’s 
data: 

1. In order of decreasing reliability, the tests stand in the 
order: recall, recognition (five-response), and true-false, 
when fifty-item tests are compared. 

2. The average working times needed were: recall, 6.9 
minutes; recognition, 5.6 minutes, and true-false 3.6 minutes. 

3. In the time needed for 100 recall items, 123 recognition 
items can be answered, and 192 true-false items can be 
responded to. 

4. When reliabilities are estimated for eqtial working 
times (6.9 min.), the orders of rank are: true-false (.664); 
recall 0618); and recognition (.607). Equal working times 
afford, it seems, the best basis for such comparisons. 

The work of Toops on college students has lead directly 
or indirectly to the studies of Ruch and Stoddard, Wood, 
Paterson, DeGraff and Ruch, Brinkley, Charles, and 
others. These studies will next receive attention. 

The investigation of Rnch and Stoddard.^ These authors 
selected a set of 100 mformation items covering the general 
field of history and the social sciences, suitable in difficulty 
for twelfth-grade pupils. These items were next broken by 
chance into two approximately equal “forms,” designated 
as Form A and Form B. The items were then adapted to 
each of the following types: 

1. Recall 4. Recognition, 2-response 

2. Recognition, 5-response 5. True-false 

3. Recognition, 3-response 

Thus, each of the 100 items appeared in five different 
type-forms. Two items (Form A only) are given on the 
next page in all five types. 

^Quoted with changes and deletions from G. M. Ruch, The Intprovtmenf of the Written 
Examination (Oiicago; Scott, Foresman and Co., 1924), pp. 107-114. For a fuller re- 
port see; G.^ M. Ruch and G. D, Stoddard, **1116 Comparative Rdiabilities of Five 
Tyi^ of Objective Ex a min a t ions,^* Journal of Educational Psychology, VoL XVI (192^, 
pp. 89-103. 
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L Recall, Form A 

1. The American Revolution began in the year 
50. Passports are issued by the Department of 

11. Five-Response, Form A 

1. The American Revolution began in (1) 1762 (2) 1775 
(3) 1783 (4) 1789 (5) 1812 

50. Passports are issued by the Department of (1) State 

(2) Commerce (3) Interior (4) War (5) Labor 

III. Three-Response, Form A 

1. The American Revolution began in (1) 1762 (2) 1775 

(3) 1789 

50. Passports are issued by the Department of (1) State 
(2) Commerce (3) Interior 

IV. Two-Response, Form A 

1. The American Revolution began in (1) 1762 (2) 1775 
50. Passports are issued by the Department of (1) State 
(2) Commerce 

V. True-False, Form A 

1. The American Revolution began in 1775. True False 

50. Passports are issued by the Department of Commerce. True False 

“In order to keep practice effects at as nearly a minimmn 
as possible, it seemed inadvisable to attempt to have each 
pupil take the two forms in all five ways. For this reason 
all pupils were given the recall type. Form A, followed direct- 
ly by Form B, and then one day later were given the same 
items in one other type-form. To administer the tests in 
this way, the total group of twelfth-grade pupils was broken 
into four sub-groups, designated as groups A, B, C, and D 
by a strictly alphabetical division. The senior classes of 
about fifteen Iowa high schools were arranged in alphabetical 
order, keeping the schools separate. The first one-fourth in 
the alphabet, aU schools combined, were called Group A, 
the second one-fourth, Group B, and so on for Groups 
C and D. 
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“It will thus be seen that the groups were random sam- 
plings with every school contributing equal numbers to each 
group. Since more than five htmdred pupils were involved, 
the sub-groups can be accepted as equal in ability for all 
practical purposes. The sub-groups numbered alx>ut one 
hundred thirty-five pupils. The following tabulation will 
make these points clearer. 


Group 

No. 

DayI 

Day 2 

A 

137 

Recall A and B 

5-Response A and B 

B 

134 

Recall A and B 

3-Response A and B 

C 

135 

Recall A and B 

2-Response A and B 

D 

133 

Recall A and B 

True-false A and B 


“The recall t 3 Tpe was given first for two reasons: first, 
because it is least suggestive of the correct answers and 
hence produces smaller practice effects on later tests; and 
second, in order that all four groups mi^t take one test in 
common as a check on the equivalence of abilities of the 
groups. 

“The reliability coefl&cients are given in Table 33. 

TABLE 33 


Reliability Coefficients of the Five Types of Examination 


(«) 

ib) 

(0 

id) 

Type 

Form A vs. 

FormB 
(i.e.» 50 items 
vs. 50 items) 

Reliability of 
100 Items 
( by the Spearxnan- 
Brown formtda) 

AT 

Recall 

.81 ±.010 

.90 

562 

5-response 

.80 ±.021 

.89 

137 

3-response 

.60±.037 

.75 

134 

2-response 

.74 ±.027 

i .85 

135 

True-false 

.56 ±.040 

1 .71 

133 


“The figures in column (c) were calculated by mpang of 
the Spearman-Brown formula by using n =2, as an estimate 
of the reliability of 100 itans. 
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“ . . . The times needed to complete one htmdred items 
were kept to the nearest half-minute, thus making it possible 
to determine: (1) the relative rapidity of administration of 
each type, and (2) the reliability per unit of working time. 
Table ^ gives such calculations. 

“The figures in column (b) of Table 34 are the average 
times needed by the groups of 135 pupils to complete (i.e., 
attempt) 100 items. The numbers in column (c) are the 
ratios of the time of the recall test to the times of each of the 
recognition tests, thus 18.7/16.0 = 1. 17, 18.7/13.5 = 1.39, etc. 
This means that 117 five-response items can be given in 
the same length of time needed for 100 recall items, 18.7 
minutes; 139 three-response items can be given in the time 
needed for 100 recall items, etc. The values m column (d) 
are brought forward from column (c) of Table 33. 

TABLE 34 


Relative Times Needed for Each Tvpe of Examination and Relative 
Reliabilities per Unit of Working Time 


(a) 

(6) 

(c) 

id ) 

(«) 

Type 

Time in 
Minutes 

TO Complete 
100 Test 
Items 

Items that 

CAN BE 
GIVEN IN 

18.7 Minutes 
(R ecall 
as base) 

Reliability 
OF 100 
Items 

Reliability 
PER 18.7 
Minutes 
WCK tKlNG 
Time 

Recall 

18.7 

100 

.90 

.90 

5-response 


117 

.89 

-90 

3-response 

13.5 

139 

.75 

.81 

2-response 

11.4 

164 

.85 

.90 

True-False 

10.2 

183 

1 

.82 



“The question at once presents itself whetha: equal work- 
ing times would result in equal reliabilities of the several 
types. If, for example, 183 true-felse items can be answered 
in 18.7 minutes (the time needed for 100 recall items), what 
are the comparative reliabilities of 183 true-false items and 
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100 recall items? We have already seen the usefulness of 
the Spearman-Brown formula for arriving at such compari- 
sons. Taking the coefficients of column (d) of Table 34 and 
using n equal to 1.17, 1.39, 1.64, and 1.83, in turn, we obtain 
the values in column (e). These might be read as follows: 
‘the reliability of type .... for 18.7 minutes working time.’ 
It will be seen that the original differences are cut down to 
such an extent that at least two of the t3T)es prove to be as 
satisfactory as the recall under equal time limits. These 
are the 5-response and 2-response forms. The true-false 
and 3-response seem somewhat inferior.” 

The following conclusions seem to follow from the data 
gathered by Ruch and Stoddard: 

1. For a constant number of items, the five tests rank as 
follows in order of decreasing reliability: recall, five-response, 
two-response, three-response, and true-false. 

2. When tests of equal working times are compared, the 
differences are small but favor the recall, five-response, and 
two-response. 

3. There is no apparent reason why the two-response 
test proved more reliable than the three-response in this 
study, the a priori expectancy being to the contrary. 

4. There is a reasonably close agreement between this 
study and that of Toops previously reported. 

Wood’s investigations. Wood has made numerous in- 
vestigations of reliabilities of old- and new-type examina- 
tions. The following data are selected, as typicffi, from two 
of his more recent and extensive studies. Table 35 shows 
selected reliability coefficients from Wood’s findings in 
these two studies. 

Wood’s reliability coefficients are somewhat higher than 
most of those previously reported, for at least two reasons: 
(a) his tests were longer; and {b) he used larger numbers of 
cases and more heterogeneous groups in most cases, especially 
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TABLE 35 


Selected Reliability Coefficients From Wood’s Studies ot* 
Objective Test Types 


Subject 

No. OF 
Items 

No. OF 
Cases 

Academic 

Level 

Type 

Reliability 

Coefficients 

Rights 

R-W 

French* 

50 


College 

T-F 

.831 


.801 


Law (Pleading) 

90 

74 

College 

T-F 

.83 


.771 


Law (Property) 

100 

mssM 

College 

T-F 

.75- 


,76- 


Law (Equity) 

70 

100 

College 

T-F 

.661 


.651 


French, Part I 

100 

2000 

J.H.S. 

5-R 

.94** 



French, Part II ... . 

60 

2000 

J.H.S. 

5-R 

.95** 



French, Part III 

60 

2000 

J.H.S. 

Completion 

.96** 



French, Parts I to III 

220 

2000 

I J.H.S. 

5-R and Com. 

.97** 



Spanish, Part I 

100 

2000 

J.H.S. 

5-R 

I .93** 



Spanish, Part 11. . . . 

60 

2000 

J.H.S. 

5-R 

I .92** 



Spanish, Part III . . . 

65 

2000 

J.H.S. 

Completion 

.94** 



Spanish, Parts I to 









III 

225 

2000 

J.H.S. 

5-R and Com. 

.97** 1 




•B. D. Wood, “Studies in Achievement Tests,*' Journal of Educational Psychology, VoL 
XVII (1926), pp. 1-22; ^peciaUy pp. 6-7. 
t Averages of four coeflScients. 

•*B. D. Wood, New York Experiments with NeuhType Modern Foreign Language Tests 
(N. Y.: Tte Macmillan Co., 1927), p. 40. Quoted by permission of the Macallan Co. 

in the French and Spanish examinations reported last in 
Table 35.^ 

Wood’s studies do not show direct comparisons of various 
types of items based on the same subject-matter, except 
roughly so in the French and Spanish tests administered to 


^The size of the coefficient of correlation depends, in part, upon the range of the values 
correlated. Kdley calls this phenomenon “range of talent.** Another name is “hetero- 
geneity.** (See Chapter XV.) Since the 2CXX> pupils taking the French and Sf^nish ezam- 
Siations were junior high-school students, it is reasonable to suppose that they represented 
a greater range of individual differences than was true of the groups taking the first- 
mentioned French and the Law ezaminatioi^ the latter being coUeg^ examinations. Be- 
tween junior h^h-school and college there is a great deal of selective elimination, as is 
well known. Tms would tend to make the reliability coefficients hi g h e r in the case of the 
younger pupns. 

It should be noted that the Toops study dealt with college groups, the Ruch-Stoddard 
study with lu^-sdiool semors, and the two studies of Wood with junior higd^-school and 
college students. The three studies show somewhat different results, but many of these 
differences are to explained the nature of the groups tested (range of talent or hetero- 
g^ity) ; although some consideratkm must be given to the different numbers of cases, the 
different school subjects concerned, and to the types of tests employed. 
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2000 New York jiinior high-school pupils, where 60- or 
65-itein completion tests may be compared with 60- or 
100-item multiple-choice tests (five-response). The differ- 
ences are slightly in favor of the completion test, but not 
significantly so. This finding is in substantial agreement 
with the results of Toops and of Ruch and Stoddard. The 
fact that tests of from 50 to 100 items yield reliabilities of 
from .66 to .96 is reassuring. 

The investigation of DeGraff and Rnch.^ Using, with 
modifications, the procedure of Toops and Ruch-Stoddard, 
these authors devised 200 simple recall items covering the 
general field of American history. These 200 items were 
then broken by chance into two forms of 100 items each, 
designated as Form A and Form B. Each form was then 
“translated” into items of the seven-response, five-response, 
three-response, two-response, and true-false types, respec- 
tively (in the same manner as was described for the experi- 
ment of Ruch and Stoddard). 

A total of 2453 pupils took Recall, Form A, the first day 
of the experiment, followed by Recall, Form B, the second 
day. On the third and last day the total group was divided 
into ten sub-groups, by chance, as follows: 

1. 7-RespCMise “Do Not Guess” 7. 2-Response “Do Not Guess” 

2. 7-Response “Guess” 8. 2-Response “Guess” 

3. 5-Response “Do Not Guess” 9. Trae-false “Do Not Guess” 

4. 5-Response “Guess” 10. True-felse “Guess” 

5. 3-Response “Do Not Guess” 

6. 3-Response “Guess” 

The “Guess” and “Do-Not-Guess” sub-groups (of each 
pair) received the same test items. One group was in- 
structed emphatically to guess at all items when in doubt 
and to leave no items blank. The other group was directed 

^Reported in full in G. M. Ruch «/ al. Objective Examination Methods in the Social 
Studies (Chicago: Scott, Foresman and Company, 1926,) pp. 54-88. For a brief account 
see the Journal of Educational Psychology, Vol. XVH (1926), pp. 368-375. 
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to omit all doubtful items and under no circumstances to 
guess. Table 36 shows the obtained reliability coefficients. 

TABLE 36 

Reliability Coefficients for Six Types of Objective Tests Covering 


THE Same Information (DeGraff and Ruch) 


Test 

“Guess'* 

Instructions 

“Do Not Guess” 
Instructions 

Rights 

Rights minus 
wrongs 

j Rights 

Ri^ts minus 
Wrongs 

Recall (.950) 





7-Response 

‘m 

;^9 

1 .^6 

.^7 

5-Response 

.864 

.902 

.862 

.882 

3-Response 

.837 

.858 

.886 

.890 

2-Response 

.745 

.864 

.859 

.843 

True-false 

.641 i 

.780 

.885 

.837 


The average working times were very different for the 
different types of tests. Table 37 shows the facts. 


TABLE 37 


Average Time in Minutes to Answer 200 Items of the Various 
Types as Found by DeGraff and Ruch 


Type 

Time 

No. OF Cases 

7-response(g)* (200 items) 

45.9 

212 

7-response(n)t (200 items) 

40.8 

206 

5-response (g) (200 items) 

5-response (n) (200 items) 

42.2 

214 

37.6 

262 

3-response (g) (200 items) 

38.2 

239 

3-response (n) (200 items) 

37.6 

219 

2-response (g) (200 items) 

34.6 

207 

2-response (n) (200 items) 

34.5 

251 

Tme-f^se (g) (200 items) 

33.0 

227 

True-false (n) (200 items) 

30.5 

244 


Recall A (100 items) 25.2\ _ a Cases 

Recall B (100 items) 24.8/"^*^ 2200 


[g) indicates the tests taken under instructions to guess. 

Cn) indicates the tests taken under instructions not to guess. 
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Table 38 shows the reliability coefficients brought to the 
common base of the time required to answer 100 recall items 
by means of the Spearman-Brown prophecy formula. The 
original or actual coefficients (those of Table 36) are also 
shown for the sake of comparison. The method of estimating 
reliability coefficients upon a basis of equal working times 
has already been described in connection with the abstract 
of the study by Ruch and Stoddard. 

TABLE 38 

Reliability Coeiticients for the DeGraff-Ruch Investigation as 
Estimated for Equal Working Times by Means of the Spearman- 
Brown Formula (The basis is that of the time needed by the average 
pupil for answering 100 recall items.) 


Type of Test 

Uncorrected for Chance 

Corrected for Chance 

Estimated by 
Spearman- 
Brown Formtala 

Orifi^nal 

Estimated by 
Spearman- 
Brown Formula 

Original 

7-response (g) 

.815 I 


.851 

.839 

7-respoiise (n) 

.901 

.886 

.920 

.907 

5-response (g) 

.884 

.864 

.917 

.902 

5-response (n) 

.893 

.862 


.882 

3-response (g) 

.871 

.837 


.858 

3-response (n) 

.913 

.886 


.890 

2-response (g) 

.805 

,745 


.864 

2-response (n) 

.955 

.859 

.905 

.843 

True-false (g) 

.729 

.641 

.842 

.780 

True-false (n) 

.925 

.884 

.892 

.837 


Correlation of Recall A vs. Recall B 950 

CoefBdent of Reliability (Recall A plus Recall B) 970 


Correlation of Recall A vs. Recall B 950 

CoefBdent of Reliability (Recall A plus Recall B) 970 


The following conclusions seem to be justified by the 
results of the DeGrafif-Ruch experiments: 

1. The recall is the most reliable test of the six t 3 rpes 
studied when an equal number of items are compared. 
















EXPERIMENTAL STUDIES 


301 


2. Instructions against guessing when in doubt raise the 
reliabilities, especially when the scores are corrected for 

W 

chance by the formula. Score -• 

«— 1 

3. The true-false type proved least reliable, the recognition 
t3T5es falling in intermediate positions, and the recall best. 

4. The average working times were quite unequal, the 
recall requiring more time than the true-false in about the 
ratio 50:32, or about 3:2. 

5. When tests of egztal working times are compared through 
estimates of reliability made possible by the Spearman- 
Brown formula, none of the recognition types (including the 
true-false) quite reached the reliability of the receill, al- 
though the approach was close enough in most cases to 
justify the conclusion that/tw equal working times recall and 
recognition types are not greatly dissimilar in reliability 
(especially when the instructions are against guessing and 
scores are corrected for chance). 

Charles’s Study. The study by Charles^ gives additional 
data on five types of objective tests as applied to students 
in college classes in general psychology. Table 39 shows his 
results. 

TABLE 39 


Reliabiuties Found by Charles for Five Twes of Objective Tests 
IN AN Examination in Elementary Psychology, 


Type of Test 

No. OF 
IlEMS 



Score "Rights 
Minus Wrongs 

Recall 

i 50 

747 

.60 


5-Response 

50 

182 

.68 

.67 

3-Response 

50 

188 

.62 

.62 

2-Response 

50 

188 

.48 

.53 

True-false 

> 50 

187 

.60 

.55 



Average of the four recognition tests .595 .592 

The probable errors range between .016 and .038 


^Op. HL 
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Charles’s reliability coefficients are comparatively low, 
although it must be remembered that he dealt with short 
tests (50 items) and with selected college groups. His 
values are in good harmony with those of Toops, who dealt 
with roughly similar groups. Charles’s results are somewhat 
out of line with previously reviewed studies in that the re- 
call tests were not noticeably better than the recognition 
tests (except two-response), and in one case (five-response) 
the recall tests proved less reliable. Charles’s recall tests, 
on the other hand, were not as highly objective as were 
those of Toops, Ruch-Stoddard, and DeGr^-Ruch. 

Rutledge’s investigations. Rutledge^ made a number of 
studies on tests given to college classes in psychology. His 
results are considerably different fi'om those reported else- 
where in this volume, as Rutledge himself points out. Tables 
40 and 41 present selected findinp from the investigations 
carried out by Rutledge. 

Rutledge comments on his results as follows: 

This correction for length of time allowed for the test does not change 
the relative order of the reliabilities of the tests. Since these relative 
reliabilities are so widely divergent from those found by other investiga- 
tors,® an analysis of the number of attempts on each type of question may 
suggest the causes of the differences. 

Using the averages from (our) Table 40, Rutledge cal- 
culated the values to be expected by tests of 120 items. 
(See Table 41.) 

Rutledge concludes: 

All of the expected reliabilities are high for thirty minutes of testing 
and indicate that any one of the three types of question may be used in 
constructing reliable tests, although the true-false is not as good as mul- 
tiple choice or completion. 

iR- E. Rutledge **1116 True-False Examination in Elementary Psychology with Sug- 
gestions for its Improveinent,” Ph. D. Thesis, University of California. Unpublished. 

2Toops foxmd rea>gmition and true-false tyx)^ of examinations to have equal reliabilities 
for equal tunes of testing. Ruch and Stoddard found three-response and true-false tests 
to be approximately equal in reliability. Brinkl^ found the order of reliability in Amencan 
history to be completion, true-false, and multiple choice. 
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TABLE 40 


Expected Reliabilities of 12-Minute 40-Item True-False, Multiple- 
Choice, AND Completion Tests (Table VII from Rutledge) 


Computed From 
Material 

hi 

Expected 


Comp. 

M.C. 

T-F 

Comp. 

M.C. 

1st 40 items 

.49 

.70 

.79 

.66 

.82 

.88 . 

2d 40 items 

.62 

.66 

.79 

.76 

.79 

.88 

3d 40 items 

.52 

.68 

.85 

.69 

.81 

.92 

Average 

.54 

.68 

.81 

.70 

.81 

.89 


TABLE 41 

Expected Reliabilities op 120 Statements of Each Type of Question 
(Table IX from Rutlege. Based on average r^s) 


Expected Reliability 


120 True-False Statements 87 

120 Completion 94 

120 Multiple Choice 96 

Average 92 


The study of Crawford and Raynaldo.^ These writers 
found that out of twenty comparisons fifteen favored the 
old-type test. It is doubtful whether much significance is 
to be attached to these results, as indeed the authors point 
out, since the makers of the true-false tests were little skilled 
in such work. Such tests were also new to the students. 
Moreover, the numbers of cases were very small in aU but 
a few comparisons, ranging from nine to seventy-four; in 
ten classes there were fewer than twenty students, and there 
were only six classes with more than thirty students. 

Reliability of matching tests. Ruch, Murdock, and 
Maupin prepared 120 paired items whereby 120 mm were 
to be matched with the same number of characterizing 

*C. C. Crawfoni and D. A- Raynaldo, **Some Experimental Comparisons of True-False 
Tests and Traditional Examinations,” School Reoiew, Vol. XXXIII (1925), pp. 698-706. 
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phrases. The 120 items were then broken into Forms A 
and B of 60 items each. Each of the two forms was next 
prepared as five groupings, i. e., sets of 5, 10, 15, 20, and 30 
pairs to be matched. ‘ 

The purpose of this study was twofold: {a) To study the 
change in reliability in going from pairs of 5, 10, 15, 20, to 30. 
(6) To study the change in difficulty in the same series; the 
increase in difficulty in larger groupings being principally 
due to lessened opportunity for chance successes, although 
at least one other factor also enters in a minor degree. 

A parallel experiment, but dealing with the matching of 
dates and events was carried out at the same time. The 
reliability coefficients, and certain other facts, are given 
here, the discussion of relative difficulties being reserved for 
a later section of this chapter (page 316). 

The sample from Form A on page 305 will help to make 
the details somewhat clearer. The sample diows the first 
thirty items (or half) of Form A in the grouping by fives. 
In the grouping by tens, the first two sets of five pairs were 
pooled; the same pooling being carried on successively for 
the fifteen, twenty, and thirty groups. 


TABLE 42 

Comparative Reliabilities of Matching Tests of Varying Groupings 

OF Pairs 



Dates-Events 

Men-Characderizations 

Groufxng 

Grades 

Grade 12 

Grades 

Grade 12 


|H[ 

N 

r 

N 

T 

N 

r 

N 

5's 

.86 

129 


130 

.93 

161 

.94 

148 

lO's 

.74 

124 


121 

.77 

164 

.93 

151 

15's 

.84 

121 


124 

.89 

168 

.92 

146 

20’s 

.79 

127 


125 

.90 

159 

.98 

145 

30's 

.76 

124 


124 

.88 

160 

.95 

146 


^For complete details, see G. M. Rtich et al, Objeciae Examination Methods in the Social 
Studies, pp. 89-104. 
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Matching Test: Men and Characterizations 

Dirtctions' Read each characterizing phrase and then find the man at the left whom 
tlae phrase fits best- Record the number of the proper man in the parenthesis in front of 
each phrase. Notice the first item is already filled in correctly. Each phrase rmtst be 
matched wttk a man tn the same section. Work as fast as you can without naming mistakes- 

FORM A 

Men Characterizing Phrases 

Section 1 

(5) Author of the Declaration of Independence 
( ) For thirty years a senator from Missouri 
( ) An immigrant who worked for political reform 
( ) Leader of Union Army in Peninsula Campaign 
( ) Congressman demanding harsh treatment of South 

Section 2 

6. Miles Standish ( ) Discoverer of the New World for Spain 

7. De Witt Clinton ( ) Spent a fortune to found a colony in America 

8. Charles Sumner ( ) Military man of Plymouth, told of by Longfellow 

9. Sir Walter Raleigh ( ) Massadausetts senator denouncing “Crime Against 

Kansas” 

10. Christopher Columbus ( ) Governor of New York — ^prcmioted the Erie Canal 

Section 3 

11. Vasco de Balboa ( ) Governor of Plymouth Colony and Pilgrim leader 

12. David Wilmot ( ) Discovered the South Sea, or Pacific Ocean 

13. Woodrow Wilson ( ) Offered a “Proviso” concerning slave territory 

14. William Bradford ( ) Portuguese explorer who rounded Africa to India 

15- Vasco da Gama ( ) The “Great War Prerident” — ^League of Nations 

Section 4 

) U. S, agent to France dunng X-Y-Z affair 
) Invented cylindrical newspaper printing press 
) Northern general who marched through Georgia 
) Laid the first successful Atlantic cable 
) Puntan Governor of Massachusetts Bay Colony 

Section 5 

( ) Fifth President — “Era of Good Feeling” 

( ) Orator who denounced “Writs of Assistance” 

( ) Railroad “King” — ^builder of Northwest 
( ) Third “Martyr President” — daot by anarchist 
( ) Kentucky statesman famous for compromises 

Section 6 

26. John Jacob Astor ( ) A South Carolina advocate of nullification 

27. Andrew Jackson ( ) The “Little Giant” debating with Lincoln 

28. John C. Calhoun ( ) Founded a fur-trading company in Oregon 

29. George H. Meade ( ) Hero of New Orleans, first president from West 

30. Stephen A. Douglas ( ) Northern general who won at Gettysburg 

Go Jtiffht on to Paae 2 [Items 31-60 appeared on Page 21 


21. James J. Hih 

22. William McKinley 

23. James Monroe 

24. Henry Clay 

25. James Otis 


16. William T. Sherman ( 

17. Cyrus W. Fields ( 

18. John Winthrop ( 

19. Richard Hoe ( 

20. Charles C. Pmckney ( 


1. Thomas H. Benton 

2. Thaddeus Stevens 

3. George B. McQellan 

4. CarlSchurz 

5. Thomas Jefferson 
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No time limits were set, but each form occupied less than 
a class period. Unfortunately the data on working times 
have been lost. 

Matching tests seem to be highly reliable, especially the 
matching of men with characterizing phrases. The lower 
reliability of the date-events matchings is largely to be 
explained by the fact that dates are difficult of memory, 
and the schools are tending (rightly) to minimize such 
learning. 

It cannot well be said that large numbers of pairs to be 
matched has any great advantage so far as reliability is 
concerned. We shall see in a later section that the larger 
groupings do reduce the average scores and hence probably 
eliminate much guessing. Against this gain there is to be 
set the greater amount of time needed and the danger of 
mistakes of carelessness. It may well be that these factors 
operate to prevent the larger groupings from showing greater 
reliability, as is the a priori expectancy. 

COMPARATIVE WORKING TDVIES 

Introduction. It would be very useful for the test-maker 
to know more or less accurately the number of items of the 
different types which can be given in a stated period of time. 
Such data could not be used slavidily, since the time re- 
qiiirements probably vary greatly with such facts as: (a) 
die maturity of the pupils, (b) mental abilities of the pupUs, 
(c) the difficulty of the test, (d) the school subject, (e) the 
nature of the test (reasoning or reproduction of facts), etc. 

Five previously cited studies have dealt with this question 
in some detail, viz., the investigations of Toops, Ruch and 
Stoddard, Brinkley, DeGraff and Ruch, and Charles. These 
five studies represent very different experimental conditions: 
grade ranges from seventh grade to college; numbers from 
71 to 2200; subjects as varied as general information, 
history, and psychology; and factual vs. reasoning tests. 



EXPERIMENTAL STUDIES 


307 


The author has previously published experimental data 
on the actual working times used by pupils in answering 
tests of different lengths and of diflFering types of items. ^ 
These findings were used as a basis of tentative recommenda- 
tions for proper time allowances. These have been objected 
to by certain writers. For this reason it has been thought 
desirable to bring together in one table the available evidence. 

Table 43 presents a concise summary of most of the 
evidence which is easily accessible. The entries for average 
time and average score are all experimental findings, al- 
though it is not to be supposed ^t each series of tests 
contained exactly 100 items. It has been necessary to bring 
them to a common base (100 items). For examplp, if a 
given investigator used a 50-item test, his averages and 
average times were multiplied by two. This assumps 
merely (a) that the pupils would continue to answer at the 
same rate, and (b) items equally difficult and similar in other 
respects. These assumptions are not particularly dangerous. 

There are enormous differences in most or all of the results 
for any particular t 3 npe of test. In general Brinkley's results 
are out of harmony with the others, taken as a whole. 
The reason for this is unknown. It can hardly be due to 
sampling, even if his experimental group was small. The 
probable explanation may center about two facts: (a) the 
longer lengths of many of the statements in his test items, 
and (b) the use of many thought questions (the ratio of 
thought to information questions was about 1 to 2). Brink- 
ley’s questions were, on the average, at least twice the 
length of those of Toops, Ruch and Stoddard, DeGraff and 
Ruch, and Charles. This would require much more time 
for the sheer reading of the questions. 

Before attempting to draw up recommendations concern- 
ing the rate at which objective tests may be answered, it 

^The Improvement of the Written Examination^ pp. 97; 113-114. Objective Examinatxon 
Methods in the Social Studies, pp. 80-84. 
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will be better to examine the more detailed findings of Ruch 
and DeGraff as presented in Tables 44 and 45.* 

The reader is cautioned to keep in mind that the table of 
working times for recall tests (Table 44) is based upon 100 
items, but the times for the recognition tsTjes (Table 45) 
are based upon 200 items. It is further to be remembered 
that the items used by DeGraff and Ruch were chiefly 
factual. 


TABLE 44 


Percentiles of Time in Minutes for Recall Tests: 100 Items 

(By Grades) 


Per- 

CENTILES 

FORMA 

FORMS 

Form A 

Form B 

Grade 

Grade 

All 

Grades 

All 

Grades 

7th 

8th 

11th 

12th 

7th 

8th 

11th 

12th 

100 

44.5 

43.0 

44.0 

44.0 

45.0 

47.5 

40.5 

40.5 

44.5 

45.5 

90 


33.4 


36.1 

303 

343 

32.9 

34.0 

34.0 

33.2 

80 

27.4 

30.6 


32.7 

26.4 

MB 

Emm 

31.2 

30.7 

30.2 

70 

25.5 

283 

29.2 


24.5 

28.4 

28.1 

29.7 

28.0 

27.9 

60 


26.5 

273 

28.4 

22.5 

26.7 

25.5 

27.2 

26.6 

25.7 

50 

22.0 

25.1 

25.0 

26.7 

21.1 


24.4 

25.5 

24.8 

24.2 

40 

mm 

23.6 


25.5 

20.0 


23.0 

24.2 

23.2 

22.5 

30 

193 


22.0 


18.7 

21.1 

21.6 

22.6 

21.0 

21.0 

20 

17.5 




17.5 

19.7 

20.0 

21.5 

19.6 

19.0 

10 

15.3 






18.2 

19.5 

17.4 

17.3 


The following comments are based principally on Tables 
44 and 45, although Table 43 has been kept in mind in 
drawing conclusions. 

Recall Tests 

1. The slowest pupils answered recall items at a rate of 
more than two items per minute (40-45 minutes for 100 
items). 

2. Seventh- and eighth-grade pupils are not markedly 
slower than high-school pupils in answering the tests. 

^Objectw Exarmnation M^hods in the Social Studies^ p. 81. 
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3. The figures showed that 90 per cent of the pupils 
finished the 100 recall questions in from thirty to thirty-five 
minutes, or at a rate of about three per minute. 

4. The average pupils (50 F>ercentile) answered the recall 
items at a rate of about four per minute (twenty-one to 
twenty-seven minutes for the 100 items). The elementary- 


TABLE 45 

Percentiles of Time in Minutes for Recognition Types: 2CX) Items 


mi 



Response Types 



True-False 

umi 


7(n) 



3 (g) 

3(n) 

2Cg) 

2(n> 

(g) 

(n) 

100 

60.0 

60.0 

60.5 

60.0 

K3B1 

60.5 

ISfii 

60.0 

55.0 

56.0 

90 

55.4 

50.7 

52.6 

48.1 

48.4 


45.0 

43.6 

44.2 

39.1 

BO 

50.4 

48.2 

48.6 

44.1 

45.1 

45.2 



39.0 

36.2 

70 

48.1 

45.3 

45.7 

41.8 


41.9 

38.8 

38.6 

36.9 

34.4 

60 

45,4 

42.4 

44.6 

40.1 

40.2 

39.7 

36.4 

36.3 

34.5 

32.4 

50 

43.9 

40.4 

42.0 

38.8 

38.6 

37.4 

34.8 

35.1 

32.8 

30.6 

40 

41.0 

38.8 

40.2 

38.2 

36.4 

35.3 

31.9 

32.9 


28.7 

30 

39.2 

36.4 

38,6 

36.2 

34.6 

33.1 

29.6 

30,9 


Emm 

20 

37.7 

33.1 

36.1 

34.9 

32.4 

30.6 

26.4 

28.8 

26.2 

253 

10 

30.4 

29.4 

32.4 

31.1 

29.1 

27.4 

23.7 

25.2 

24.6 

21.7 


school pupils finished somewhat sooner than the high-school 
pupils, but this is to be explained by the fact that the former 
did not attempt as many questions. 

5. It should be noted that these figures are based upon the 
times needed to go through the test (answering the questions 
which the pupils knew) and not on the time needed to 
answer all 100 questions. Moreover, these recall tests were 
very difficult, as is attested by the fact that the average 
rec^ score was about twenty-ei^t correct out of a possible 
100. The strictly average pupil probably actually attempted 
not more than half of the items, and very few scores above 
90 were found even on the best papers. 

6. If the recall test had been of more nearly “ideal” 
difficulty (one whose average score is about 50 per cent of 
the maximum), the pupils would have attempted more 
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items, but against this fact we can set the opposing one that 
easier items would be answered more quickly. 

7. For two reasons it is not ideal to base recommendations 
on the times needed by the slowest pupils: (c) such times 
represent wasted time to some degree, as inferior pupils 
are prone to “putter around” almost indefinitely at such 
tasks; and (6) a slight premium on speed is justifeble. In 
making standard tests it is often the practice to set the time 
limits so that 90 per cent can attempt all items within their 
power. Using the data for nintieth percentiles, we can say 
that: 

(a) In junior and senior high-school classes, three recall 
items per minute is not an excessive requirement provided 
the items are fairly short and of a factual character. For 
reasoning tests, one or two items per nainute should be a 
reasonable number, 

(&) It will be wise to increase these time allowances in 
the lower grades. 


Recognition Tests 

1. The slowest pupils answered these at a rate faster than 
three per minute. 

2. Four to five multiple-choice or true-false items were 
handled per minute of working time, approximately 90 
per cent of the pupils having time to attempt all items within 
their power. 

3. The average pupil answered five or six multiple-choice 
(three to five responses) or true-false items per minute. 

4. Some time was saved by instructing pupils to omit 
items rather than to guess when in doubt, 

5. Using the values in the ninetieth percentile row of 
Table 45, it seems reasonable to conclude (for tests of the 
type employed by DeGraff and Ruch) that: 

(a) Four items per minute is a reasonable expectancy for 
upper-elementary and high-school pupils for multiple- 
choice tests (three to five responses) and true-false tests. 
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(J) If thought questions are used, two or three items per 
minute is a safer practice. (This “squares” fairly well with 
Brinkley’s results as well.) 

(c) Three items per minute would seem to be safe for 
pupils in grades four to six, although this recommendation 
is inferred from the working rates of upper-grade pupils. 

The foregoing statements are very conservative in the 
light of the evidence. Brinkley’s findings alone might be 
used as an argument against them. The Toops and Ruch- 
Stoddard studies indicate much higher rates of work.* 

A further comparison. It has been stated that because 
of the very unequal working times needed for various ts^pes 
of objective tests, comparison upon a basis of the number 
of items which can be completed in a given length of time is 
fairer than upon a basis of equal numbers of items. It has 
already been found that 100 recall items can be given in a 
class period of from forty to sixty minutes. The following 
data, assembled from several sources and brought to a 
common base, shows something about the relative working 
times needed for recall, five-response, and true-false tests. 

The agreement is very dose indeed when the wide differ- 
ences in experimental procedures are considered. 


TABLE 46 

Number of Items Which Can Be Given in the Time Needed for 100 

Recall Items 


Authority 

Recaxl 

S-Rbsponse 

True-False 

Toops 

100 

123 

192 

Ruch-Stoddard 

100 

117 

183 

DeGrafif-Ruch 

100 

133 


Brinkley* 

100 


Charles 

100 

ub 




♦Brinkley’s “Word or Phrase Answer” was considered as simple recalL 


^The forecoing recommendations are cut almost in half from those ori^mally published 
in the authors Improvement of the Written Examtnation (p. 97), those being basM upon the 
Toops and Rxich-Stoddard studies which were the only ones in print in 19^. As has been 
mentioned, objections have been taken to the authoirs earlier figures. The earlier figures 
were experimental findings. Repetition of the experiments on a laro scale by DeGrafif 
and the author has lead to the more conservative recommendations of the present volume. 
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COMPARATIVE DIFFICLXTIES 

Introduction. Table 43 on pages 308-9 of this chapter pre- 
sented certain facts about the comparative difficulties of 
recall, recognition (multiple-response), and true-false tests. 
This section presents six studies in some detail. 

Toops’s results. Toops, as we have seen, gave three 
versions of the “same” items to college students. The 
following table is taken from Table I, p. 47, of “Trade Tests 
in Education,” each average being the result when a given 
type was taken first (Toops gave all three tests in each 
possible order), in order to eliminate practice effects. 


TABLE 47 

Av’erage Difficulties of Recall, 5-Response, and True-False Tests 
AS Found by Toops (The number of items is 50.) 


Type 

Average Score 

No. OF Cases 

Recall 

29.7 

76 

5-Response 

33.4 

26 

True-false 

35.8 

22 



Results of Ruch and Stoddard. Table 48 is a compari- 
son of five t5npes of tests these authors studied.* 


TABLE 48 

A\'erage Difficulties of Recall, Multiple-Response and True- 
False Tests as Found by Ruch and Stoddard 


Type 

No. Items 
Per Form 

Form A 

Form B 

No. OF Cases 

Recall 

50 

12.2 

10.8 

562 

5-Response 

50 

27.2 

22.8 

137 

3-Response 

50 

30.6 

26.4 

134 

2-Response 

50 

35.6 

32.0 

135 

True-false i 

50 

30.1 

27.7 

133 


Brinkley’s results. Brinkley (“New-Type Examinations 
in the High School”) obtained the averages shown in Table 
49. These results are abstracted from Brinkley’s Table 


improvement of the Written Examination , p. 112. 
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XI^, p. 98. The averages shown here are for tests of eqxial 
working times (thirty-one minutes in each case). 

TABLE 49 


Brinkley’s Findings on the Comparative Difficulties of Certain 
Types of Objective Test Items 


Type 

Average 

True-false 

45.6 

Multiple-response (3, 4, and 5 responses) 

32.9 

Completion 

18.7 

Word or phrase answer (a simple-recall test) 

21.3 

Arrangement 

23.8 



It is to be noted that Brinkley's results are not directly 
comparable with those of Toops, Ruch and Stoddard, Charles 
(to follow), and DeGrafif and Ruch (to follow) because 
uneqtial numbers of items and different items are compared. 

Results of DeGraff and Ruch. Table 50 is taken from 
Objective Examination Methods in the Social Studies, p. 79, 
with changes. This investigation has already been described. 

TABLE 50 


Comparative Difficulties of Objective Test Items as Reported by 
DeGraff and Ruch (Averages) 


Type of Test 

Recog. A 
(Uncorrected 
for Chance) 

Recog. A 
(Corrected 
for Chance) 

Recog. B 
(Uncorrected 
for Chance) 

Recog. B 
(Corrected 
for Chance) 

7-response (g)* . . 

50.0 

41.5 

39.6 

32.6 

7-response (n)t. . 

44.9 

40.0 

37.2 

33.1 

5-response (g). . . 

54.2 

43.4 

45.5 

35.4 

5-response (n). . . 

48.8 

423 

42.1 

36.4 

3-response (g) . . . 

62.2 

43.6 

55,5 

36.6 

3-response (n). .. 

54.1 

41.9 

48.2 

; 36.1 

2-response (g). . . 

71.7 

43.6 

67.2 

37.1 

2-response (n). . . 

65.1 

45.8 

60.3 

40.2 

True-false (g) . . 

65.8 

32.3 

61.3 

26.0 

True-false (n) . . 

51.0 

30.8 

47.6 

26.8 


indicates the teste taken tmder instructions to guess. 

W indicates the tests taken under instructioi^ not to guess. 
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Charles’s study. Table 51 summarizes the results of the 
unpublished study by Charles already cited. 

TABLE 51 


Results Obtained by Chahles on the “Same” Items in General 
Psychology When Applied as Five Types of Objective Items 


Type 

Averages 

No OF Cases 

Uncorrected 
for Chance 
(“Rights") 

Corrected for 
Chance 

l® (n-l)J 

Recall 

26.9 


747 

5-Response 

58.9 

48.4 

182 

3-Resix)nse 

70.7 

56.5 

188 

2-Response 

77.5 

55.5 

188 

True-false 

67.4 

38.1 

‘ 189 


Bncb, Murdock, and Maupin on matching tests. For 
the rough outline of the procedure in this investigation, the 
reader is referred to pages 303-5 of the present chapter. 
Table 52 shows the average scores for the two matching tests. 


TABLE 52 

Relative Difficulties (Average Scores) for Matching Tests for 
Varying Groupings (Form A, only; 60 pairs of items) 


Grouping 

Dates-Events 

Men-Characteristics 

Grades 

Grade 12 

Grades 

Grade 12 



M 

N 

M 

N 

M 

N 

5's 1 

9.7 

129 

16.8 



161 

44.7 

148 

lO’s 

6.4 

124 

11.5 

121 


164 

36.7 

151 

15’s 

6.1 

121 


124 

22.1 

168 

32.3 

146 

20's 

5.2 

127 

9.5 

125 

20.0 

159 

29.9 

145 

30's 

5.0 

124 

9.9 

124 

163 

160 

26.2 

146 


There is a steady decrease in the average scores of both 
tests as we move in the direction of the large groupings, thus 
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showing a decrease in the amount of the score due to chance. 
The dates-events test was far too difficult, the men-charac- 
teristics test being of about the proper difficulty. 

On the whole the evidence is ambiguous, but logical 
considerations point toward the use of from ten to twenty 
pairs as being a fair compromise among the several factors 
involved. 



CHAPTER Xn 


CHANCE AND GUESSING IN RECOGNITION 
TESTS 

Two approaches to the problem of chance effects in test 
scores. That most forms of recognition tests are open to 
the effects of chance and guessing is evident. The extent to 
which such effects are dangerous is as yet tmsettled. There 
is a considerable literature on chance effects in test scores, 
especially in the case of the true-false test. These dis- 
cussions fall into general groups: 

{a) A priori or mathematical considerations of chance and 
guessing from the standpoint of the mathematical theory of 
probability. 

(6) Experimental studies of the comparative reliabilities, 
validities, difficulties, etc., of tests subject to and not subject 
to guessing. 

There is no intention here of decrying the merits of the 
published works of the first mentioned type, although it 
must be admitted that the author is biased in favor of 
accepting the results of actual experimentation whenever 
a priori and experimental results seem to be opposed. In 
spite of what has been said, it is nevertheless true that many 
writers on the chance element in true-false tests have mis- 
understood the implications of the theory of probability in 
discussing the subject. 

Certain writers have chosen to regard the situation of a 
pupil in taking a true-false test as being analogous to such 
chance situations as drawing by lottery from a container 
holding black and white buttons (or other objects) or of 
tossing coins. Under certain circumstances the answering 
of true-false tests may be a matter of pure chance, but often 
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it is not. We should draw a distinction at this time between 
guessing and pure chance (in the sense that a pupil has no 
better basis of choice than something equivalent to tossing 
a coin or drawing black and white balls from a box). 

THE MATHEMATICS OP CHANCE APPLIED TO TESTS 

Pure chance and goessing contrasted. As a pupil goes 
through a true-false (or other multiple-choice) test and faces 
a decision on each successive item, in tuum, we can recognize 
his responses as falling into several roughly distinguishable 
categories: 

(«) Items on which he is absolutely sure of the correct 
response. 

(6) Items on which he is not entirely certain, but which 
he answers without serious doubt of the correctness of his 
responses. 

(c) Items on which he is in grave doubt. He has a 
“feeling” (or often, merely a “hunch”) that a certain re- 
sponse is correct. 

id) Items on which he is totally ignorant, and on which 
any response, so far as he can tell, is a matter of pure chance. 
In such a case the alternatives are (1) to guess or (2) to omit. 

To these we must add a fifth category: 

(e) Items which are answered in good faith, but are wrong, 
not due to chance, but because the pupil is actually misin- 
formed. These are not guessed responses in any legitimate 
meaning of the word. 

It is logical to assume that only items of the (d) category 
obey the mathematical theorems of probability in their 
distributions of correct and incorrect responses. 

Weidematm, as previously quoted, has commented upon 
the specific determiner as potent in “casting the die” at 
times when the pupil is in real doubt. When some clue is 
afforded by the wording of the item, that item falls into 
category (c) rather than (d) as listed above. 
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It is permissible, perhaps, to anticipate later discussions 
by pointing out that the instructions may be phrased so as 
to encourage pupils to omit completely items of t37pe (d) 
rather than to guess, and hence much pure guessing may be 
obviated. 

Misunderstandings of the implications of the mathemati- 
cal theory of probability. One writer^ has criticized the two- 
response test in no uncertain terms as follo'vre: 

If one holds that by subtracting the wrong from the right answers one 
eliminates the guessing factor, no more and no less, one must assume that 
individuals in such tests always guess an even number of times; for if they 
should guess an odd number of times, the effect of guessing could not be 
wholly eliminated by this method. One must further assume that if an 
individual happens to guess wrong the first time, Ms second guess must he 
right, his third wrong, his fourth right, and so, guessing right and wrong 
alternately; for if he should guess wrong or right twice in succession, the 
method could not eliminate the guessing effect That the law of chance 
does not operate even approximately this way can easily be demonstrated, 
(p. 236.) 

But if one argues that the guessing factor and no more is eliminated by 
subtracting the wrong answers from the right one must stiU further assume 
that every wrong answer is a guess; otherwise, if the wrong answers are 
subtracted, they wiQ cancel not guesses but actual achievements. That 
only guesses cause wrong answers no one would care to assume/' (p. 238.) 

This writer’s final conclusion is: 

What then does the final score of such tests represent? No one knows. 
That it cannot even approxiinately represent real ability or actual achieve- 
ment has been shown. Nothing important, therefore, should be done on 
the basis of the score, (p. 239.) 

(The reference here is to a table showing the results of 
drawing from a bag containing thirty-five white and thirty- 
five black buttons.) 

With Hahn’s second paragraph the author has no par- 
ticular quarrel if Hahn had in mind the responses labelled 

H. Hahn. "A Criticism of Tests Requiring Alternative Response,” Journal of 
Educational Researck, VoL VI (1922), pp. 236-240. 
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{e) in our preceding discussion. However, only a portion of 
wrong answers fall into category {e ) ; many are undoubtedly 
to be classified under category (d). 

McCall, in an early paper, ^ comments on the justice of 
scoring of true-false tests as follows: 

Let us consider first the reason for expressing a pupil's score as the 
number correct minus the number wrong. Imagine a pupil who is abso- 
lutely innocent of any knowledge of the physical features of the United 
States. Were such a pupil to take the above test and were he to mark 
every statement, he wotild according to the theory of chance mark ten 
statements correctly and ten incorrectly. The chances of his guessing 
right or wrong are fifty-fifty or one to one. His score on the above would 

Score=10-10=0. 

In short, the pupU’s knowledge is zero, and the method of computing his 
score gives him zero. Suppose instead that he knows ten statements and 
guesses at the other ten. Of the ten guessed at, he would, according to 
chance, get five correct and five wrong. That is, even though his real 
knowledge is ten, he will show fifteen correct (10+5) and five incorrect. 
The method of computing his score brings out his real knowledge. 

Score = 15 —5 = 10. 

A pupil who marks every statement correctly makes a fierfect score, viz., 

Score=20— 0 =20. 

like Hahn, McCall does not seem to distinguish between 
the most probable score and the one which is actually ob- 
tained. The mathematics of probability does not assume 
that every individual will be justly scored by the R— W 
formula, but merely that such a correction yields the most 
probable score, and that the average of a large number of 
individual scores or the score of one individual on a very 
large number of items should fall measurably close to the 
value given by the R— W method. Here again, too much 
faith is attached to the exactness with which probability 
works. A little experimenting with coin tossing is all that 
is needed to prove the justice of this criticism. 

iW. A. McCall New Kind of School Examination,** Journal of Educational Research, 

VoL V (1920), pp. 33-46. 
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The replies of Barthelmess and Odell to Halm’s argu- 
ments. There have been several refutations of Hahn’s 
arguments, notably those of Barthelmess* and Odell.® 

Combining points raised by Barthelmess and Odell with 
certain observations of the author, there are four assump- 
tions made by Hahn, viz.; 

1. That an individual alwajre guesses an even number of 
times. 

2. That if an individual happens to guess wrong the first 
time, his second guess must be right, the third wrong, the 
fourth right, etc. 

3- That every wrong answer is assumed to be a guess. 

4. That for every wrong answer (due to guessing) exactly 
the same number were right by guessing. 

The first two assumptions reduce to a misunderstanding 
of the laws of probability. The theorems summarizing 
chance phenomena imply nothing of the sort. The laws of 
probability imply merely that for large numbers of chances 
pure guessing will yield right and wrong responses in ap- 
proximately equal numbers. No order of alternation of 
such right and wrong guesses is implied, and no question 
of odd or even numbers of times enters. 

Assumption three has some bearing. Barthdmess says, 
and rightly in the author’s opinion Qoc. cit., p. 358): “It is 
true that with the method of scoring used (i?— TF), one 
assumes that every wrong answer is a guess. It will be 
guessing unless (a) the pupil has learned the fact erroneously, 
or (b) the wording of the question is suggestive.” 

Bmrthelmess advances no proof of this statement, but the 
findings of Weidemaim emphasize the effects of certain 
wordings, and Foster and Ruch, and others, have shown 
pretty conclusively that some wrong answers are not 

M. Barthehness *TR,eply to a Critictem of Tests Reqturing Alternative Responses/' 
Journal of Educational Research, Vol. VI (1922), pp. 357-359. 

TT “.pother Criticism of Tests Requiring Alternative Responses/' ibid^, 

Vol. VII (1923), pp. 326-330. 
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guesses. To the extent that genuine misinformation causes 
wrong responses, the R—W formula penalizes or over- 
corrects for chance. Odell goes even further and appears to 
defend over-penalizing wrong responses (toe. ciL pp. 327-328) : 

It is a rather generally accepted maxim among educators that when we 
know something we should know that we know it, and, furthermore, that 
it is better to know that we do not know something than to think we know 
it when we do not. Therefore, if the student in question thinks that he 
knows the correct answers to all the exercises but really gives incorrect 
answers to five of them (25 in aH), a deduction should be made from his 
positive score of twenty. In other words, according to this sjretem of 
scoring, a student who knows twenty and does not attempt the other five 
will receive a higher score than a student who thinks he knows all of them 
but is mistaken in some cases. This is as it should be. 

The reader may not agree mth Odell in entirety, but it 
must be admitted that he has raised an issue that must be 
settled. We diould not, however, confuse Odell’s position 
with that of Hahn who had a different situation in mind. 

Barthdmess comes more directly to the point when he 
says {loc. ciL, pp. 357-358): 

In the first place, the best test of any test is correlation frith a critoion. 
If this correlation is satisfactory, we can forget all minor criticisms con- 
cerning chance 

In the second place, no one has a right to insist on perfect reliability. 
The question at issue is, “Does this method secure more accurate results 
than other avmlable methods?” 

Barthelmess’s position will receive considerable support 
from data to be presented later. 

Turning now to Hahn’s fourth assumption, viz., “That 
just because a certain number of answers are [sfc.^ wrong, the 
same number of right answers must be guesses” (loc. cit., 
p. 238), we have again a misunderstanding of the lavrs of 
probability. The implication of the R—W formula is not 
that right and wrong guessed refuses are exactly equal in 
numbers but that equality of niunbers is the most probable 
outcome. There is really a considerable difference between 
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actual values and most probable values. Thus, if ten pennies 
are tossed into the air and allowed to fall at will, the actual 
number of heads (or tails) cannot be controlled or foretold, 
although the most probable eventuality is five heads. The 
situation would, theoretically, be the same in a true-false 
test, if purely chance answering could be assumed. 

It is quite evident that the R—W formula would not hold 
exactly for every individual pupil if true and false responses 
follow the same laws as tosses of jjennies. The seriousness 
of this inadequacy of the correction formula will be discussed 
again as new experimental evidence is brought forward. 

Other criticisms of alternate-response tests. West,^ 
Asker,* Kohs,* Kohs and Richards,^ Richards,® Holzinger,® 
and many otiiers have commented at some length on the 
scoring of two-response tests. Only the briefest mention of 
these various points of view is possible here. 

West gave a nonsense test to be marked “S” or “D” in- 
discriminately. These test items then were scored by an 
answer key belonging to another test. The scores thus ob- 
tained were corrected by the R—W method. The corrected 
scores on fifty items ranged firom —18 to -f-20, the average 
being about 1.03, and the resulting distribution almost 
normal (that representing pure probability). The middle 
half of the corrected scores fell between —4.2 and -[-6.3, a 
range of about ten points. West then carried out a some- 
what similar experiment with a fifty-item S3monym-antonym 
test, made sufficiently difficult that a great many wrong 

ip. V. West, "A Critical Study of the Right Minus Wrong Method/* Journal of Educa- 
tional Research, VoL VIII (1923), pp. 1-9. 

*W. Asker, **The Reliability of Tests Requiring Alternative Responses/* Journal of 
Educational Research, Vol. IX (1924), pp. 234-240. 

*S. C. Kohs, “High Test Scores Attained by Sub- Average Minds,** Psychological Bulletin, 
Vol. XVII (1920), pp. 1-5. “6 s 

<0. W. Richards and S. C. Kohs, “HSgh Test Scores Attained by Sub-Average Minds/* 
Journal^ Educational Psychology, Vol. XVI (1925), pp. 8-17. 

*0. W. Richards, “High Test Scores Attained by Sub-Average Minds, III,’* Journal of 
Experimental Psychology, vcA. VTI (1924), pp. 148-156. 

*K. Hcdzin^, Sco^^ Multiple Response Tests,** Journal of Educational Psychol- 
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guesses (answers) resulted. West’s analysis of his results, 
among other things, pointed to two conclusions: (a) more 
items were guessed right than were guessed wrong, and (b) 
women guessed ri^t somewhat oftener than the men. West 

concluded that this R—W method “ is of 

very doubtful reliability for group testing and especially so 
for the analysis of individual ability” (p. 8). 

Asker reports two experiments somewhat like those of 
West. He used decks of cards to simulate the situations of 
two- and three-response tests. Using twenty individuals, he 
obtained a range of scores from —8 to 10 with an average 
at zero. When these scores were expressed in per cents, and 
70 or 75 was taken as the passing mark, “not one individual 
was able to pass the test by guessing.” Asker next attacks 
the problem through mathematical analysis by expanding 
the binomial (P-l-Q)”, where P is the expectancy of heads 
by chance and Q is the expectancy of tails by chance. He 
shows (Table I, p. 236) that the probabilities of obtaining 
the following scores are as shown in Table 53. 


TABLE 53 

Asker’s Table of Probabilities for Two-Response Tests 
(20 Items) 


Right 

Wrong 

Corrected Scores 

i 

Probability: 

One in 

R-W 

Per Cents 

20 

0 

20 

100 

1,048,576 

19 

1 

18 

90 

52,429 

18 

2 

16 

80 

5,519 

17 

3 

14 

i 70 

920 

16 

4 

12 

60 

216 

15 

5 

10 

50 

68 

14 

6 

8 

40 

27 

13 

7 

6 

30 

13 

12 

8 

4 

20 

8 

11 

9 

2 

10 

6 

10 

10 

0 

0 

5.7 
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For three-response tests corrected by the formula, 

Asker gives the expectancies (Table II, p. 237) shown below. 

TABLE 54 

Asker’s Table of Probabilities for Three-Response Tests 


(20 Items) 


j 

Right 

"Wrong 

Corrected Scores 

Probability: 

One in 

R~\W 

Per Cents 

20 

0 

20 

100 

3,486,784,401 

19 

1 

18| 

92| 

87,169,610 

18 

2 

17 

85 

4,587,874 

17 

3 

15§ 

77i 

282,323 

16 

4 

14 

70 

44,978 

15 

5 

12| 

62J 

7,028 

14 

6 

11 

55 

1,406 

13 

7 


471 

351 

12 

8 

8 

40 

108 

11 

9 

6J 

32i 

40 

10 

10 

5 

25 

18 

9 

11 

3^ 

m 

10 

8 

12 

2 

10 

6.7 

7 

13 


2h 

5.5 

6 

14 

1 -1 

-5 

5.5 


Asker realized that guessing at all items in a test is not the 
usual situation, and he went on to set up further tables which 
assumed that certain proportions of the items were known 
certainly and the remainder only guessed at. If we examine 
Tables 53 and 54 with care, and make the further assumption 
that, in a true-false test of 100 or more items perhaps not 
more than one-third (if that many) are pure guesses, the 
possibility of obtaining any considerable fraction of the 
total score by chance is seen to be not very great. From 
Table 53 it is evident that if a pupil guesses at twenty items 
in a 100-item true-false test, there is but one chance in 216 
that he will earn more than ten points more than he deserves. 

In general, with true-false tests of sufficient length for 
adequate reliability, and provided further that the pupil 
guesses at no more than twenty per cent of the items, there 
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is little likelihood of errors greater than ten per cent of the 
total (corrected) score arising often enough to be a serious 
practical matter. Twenty iier cent of guesses may seem a 
small fraction, but this number (or less) may be obtained by 
the combination of (a) instructions to omit items when in 
doubt, and (&) tests of not too great difficulty. This assertion 
also takes into account the fact that many so-called “guesses’ ’ 
are not pure guesses but are responses made with a sufficient 
“fringe of knowledge” to guarantee a high excess of rights 
over wron^. Only the pure guesses are to be reckoned with 
in estimating the probabilities by mathematical formulas. 

Kohs and Richards, in the three articles cited, set up 
even more extensive tables of probabilities, expanding their 
tables to include four-response tests as weU. Space will not 
permit the reproduction of their numerous tables, although 
it is only fair to point out that Kohs and Richards are 
somewhat more skeptical of the adequacy of the chance 
correction formula than is Asker. 

Holzinger (op. cit.) gives a simple algebraic proof that the 
rights and rights minus wrongs correlate to unity (1.00) 
when there are no items omitted. This is obviously true, 
although Hol 2 dnger seems to be the first to give a mathe- 
matical proof. The practical significance of this proof is 
not altogether clear, as evidence to be presented later shows 
that guessing at all items may not be a defensible practice. 

The foregoing brief abstracts and comments are hardly 
adequate to the mathematics of the situation, but the author 
is swayed to omit further reference to the a priori and 
mathematical approaches to the problem in fevor of actual 
esperimentaj results obtained rmder normal classroom condi- 
tions. That mathematical euid expoimental analyses of the 
chance situation may not agree entirely is to be expected in 
the light of such considerations as: 

(1) Many so-called “guessed” re^onses represent answers 
made upon a basis of marginal information or “fringes of 
knowledge,” not very definite or certedn, but nevertheless 
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sufficient to cause an excess of right over wrong responses, 
and hence do not follow the mathematical expectancies to 
be derived from the binomial theorem or other basic laws 
of probability. 

(2) Wrong answers, definitely matters of misinformation, 
occur, and these cannot reasonably be held to follow laws 
of chance. 

(3) There is sufficient evidence (presented later) to prove 
that much pure guessing can be eliminated by suitable 
instructions. This fact tends to lessen the errors presumably 
unavoidable from purely mathematical reasoning. 

The binomial theorem applied to chance responses in 
tests. The serious student of examinations will naturally 
wish to know the mathematical basis by which Asker arrived 
at the results given in Tables 53 and 54. 

If we may assume that the answering of true-false and 
other two-response tests reduces to a pure chance situation 
when the pupil is absolutely ignorant of the correct answers 
to certain questions, the theoretical expectancy ot R—W 
scores of any given magnitude may be derived from the 
expansion of the binomial, {t+f)”, where t represents true 
responses and / stands for false responses.^ 

Any text on college algebra or many elementary treat- 
ments of statistical methods will explain the application of 
the binomial theorem to purely chance situations.* 

Using t for true responses, / for false responses, and n for 
the ntnnber of items guessed at, the formula for the binomial 
expansion is given on the next page. 

“Later discussions will show that there is some evidence to the effect that true-false 
tests behave sanewhat differently from two-response recognition tests. In particular, 
true-false tests seem to be markedly more difficult than two-response tests based uixin the 
same subject-matter, as nearly as the two types of tests can be made to cover the same 
ground. The reason for the greater difficulty of the true-false test may lie in the fact that 
there is a certain negative suggestion effect in such tests. Moreover, it will be demonstrated 
later that there are some reasons to think that a true-false test is not a typical two-response 
test. (See ]^es 343 to 345.) 

=H. L, Kietz and A. R. Crathome, College Algebra (New York: Henry Holt and 
Company, 1919), pp. 93-95. Or, 

L. Thurstone, The Fundamentals of Statistics (New York: The Macmillan Company, 
1923), pp. 132-141. 
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The symbol, !, means the factorial of the number, i.e., 
the product of all numbers from 1 to that number. Thus, 
factorial 6 (written 6!) is Ix2x3x4x5x6, or 720. 

= t‘-^+ . . . 

The rth term is: 

«(«-!) (”-• r+2) 2 ) 

which may also be written: 

(,-l)i r«-r+l)! (Fowmla2a) 

It is actually simpler to find the coefficient of any term (the 
rth term) by multiplying the coefficient of the preceding 
term by the exponent of i in that term, and dividing this 
product by a number one larger than the exponent of / in 
that term. This procedure is simpler only if each term is 
being found in order. If a given term is to be found in 
isolation (the preceding terms not being foxmd) Formula 
2a above is more convenient. 

The first three terms of (t+f)^ may be foimd by Formula 
2a as follows: 


The first term (r = 1) is: 

20! p-i+i/i-i —20: m—fp 

(1-1)! (20-1+1)1 ^ 20! ‘ ‘ 

The second term (r=2) is: 

20- rt0-2+lf2-l — 20! 

( 2 - 1 ) 1 ( 20 - 2 + 1 )!^ ^ 1 ! 19 !*-^ ^ 

The third term (r=3) is: 

20! ^-3+tf3-i__20!_ 

(3-1)1(20-3+1)!^ ^ 2!18! ^ 
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If we continue in the following manner until the last or 
(«-fl)th term is reached, we obtain the following table 
(Table 55) of coefficients of t and the probabilities shown. 
It is to be noted that there are »-t-l terms in any expansion 
of the binomial to the exponent n. 

TABLE 55 

Table Showing the Coefficients of Each Term and the Probabili- 
ties OF Occurrence of T (True) Responses According to the 
Expansion of the Binomial {/-h/)“ 


(0) 

(*) 

(C) 

Term 

Coefficients 

Probability: 1 in: 

1 

1 

1,048,576.0 

2 

20 

52,428.8 

3 

190 

5,518.8 

4 

1,140 

919.8 

5 

4,845 

216.4 

6 

15,504 

67.6 

7 

38,760 

27.0 

8 

77,520 i 

13.5 

9 

125,970 

8.3 

10 

167,960 

6.2 

11 

184,756 

5.7 

12 

167,960 

6.2 

13 

125,970 

8.3 

14 

77,520 

13.5 

15 

38,760 

27.0 

16 

15,504 

67.6 

17 

4,845 

216.4 

18 

1,140 

919.8 

19 

190 

5,518.8 

20 

20 

52,428.8 

21 

1 

1,048,576.0 

Sum 

1,048,576 



Table 55 shows the sum of the 21 coefficients to be 1,048,- 
576. Column (c) shows this sum divided by each coefficient 
in turn in order to express the probability of occurrence. 

In order to apply this table to the actual situation of a 
true-false test, it will help to examine the first and last terms 
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in the expansion of (<+/)“. When expanded the expression 
reads as follows: 

P+20liy+190/»®^+1140F/s+4845iiy"+ ...+ 
190<!fis+20?f“+/“. 

The first term (/“) represents the situation: 20 true and 0 
false. The second term (20i^f^) represents the situation: 
19 true and 1 false. The iJiird term (190i‘^) represents the 
situation: 18 true and 2 false. The exponents of t in each 
term represent the numbers of true responses and the 
exponents of / in each term show the number of false re- 
sponses. The ratio of the coefficient of any term to the sum 
of the coefficients of all (n-|-l) terms is tiie probability of 
occurrence of the situation represented by that term. 

EXPERIMENTAL INVESTIGATIONS 

General considerations. It has been pointed out that 
there are two general approaches to the problem of chance 
effects in test scores, viz., (a) from the standpoint of the 
mathematical theory of probability, and (&) by actual 
experimentation. Typical studies of the first type have been 
considered in the preceding section of this chapter. 

Some of the difficulties in the way of appl3dng probability 
theorems to the chance situation have been mentioned. 
In the first place, much so-called guessing is not pure 
chance. The pupil, on the contrary, responds with subliminal 
or marginal knowledge suflident to cause him to “guess” 
right far oftener than wrong. Again, there is the possibility 
that true-false tests are not situations affording a 50:50 
chance for success since there may be positive or negative 
suggestion effects in such tests. This evidence will be 
presented in Chapter XIII. In the third place, as we shall 
soon see, there is some reason to hold that the chance effects 
are slightly different in true-false and two-response tests 
proper. Lastly, there is some evidence that guessing should 
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be discoiiraged in framing test instructions, and that direc- 
tions against guessing really help to control the chance 
factor. 

It has been assumed that guessing in recognition tests 
may be corrected, at least sufficiently well for practical 

piUTPoses, by the formula; ScoTe=No. Right 

(ft 1) 

where n is the number of responses presented, and from 
which one correct answer is to be selected. 

For true-false and two-response tests, this formula is: 

Score =No. right minus the number wrong, or S=R — W 

For three-response tests, the formula is: 

w 

Score =No. right minus ^ the number wrong, or S=i ? — — 


Although this general formula appears to be logical for 
purely chance situations, it is possible to study the actual 
value of the formula experimentally. There is also the 
pertinent question of ascertaining whether pupils ^ould be 
told to guess or not to guess when in real doubt. This also 
is open to experimental study. 


Four questions will serve to state the general issues: 


1. Does the R—z — rr formula increase the reliability of 

(«-l) 

test scores when compared with scoring simply the number 
right? 

W 

2. Doesthe ^ formula increase the validity of test 

scores when compared with scoring simply the nmnber ri^t? 

3. Should pupils be instructed in favor of or against 
guessing when the answer is entirely unknown (after careful 
thought)? 


4. Does the J? — 


W 


(«-l) 

correct for pure chance successes? 


formula over-, under-, or properly- 
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These four questions will be discussed in the following 
pages. The order of treatment, will be historical rather 
than topical. 

The investigation of the scoring of Army Alpha. The 

first e:q)erimental study of this question seems to be some 
rough preliminary work done during the preparation of 
Army Alpha for use in the World War.‘ 

The ten tests tried out for the final forms of Army Alpha 
(of which eight were finally retained) were scored in two 
ways: (1) number right and (2) number right minus 
number wrong, for seventy soldiers at Camp Lee. Table 56 
shows the results. 


TABLE 56 


CORHELATION OF RIGHTS AND RIGHTS MmUS WRONGS FOR TEN PRE- 
LIMINARY Tests of Army Alpha on 70 Cases. All Correlations 
Are Against Total Scores 


Test No. 

Type of Test 

Rights 

R-W 

1 

Oral directions (little chance) 

.76 

.61 

2 

Memory for digits (discarded) 

.62 

.36 

3 

Mixed-up sentences (true-false) 

.66 

.72 

4 

Arithmetic problems (little chance) 

.84 

.70 

5 

General information (4-response) 

.90 

.81 

6 

Synonym-antonjma (same-opposite) 

.76 

.82 

7 

Practical judgment (3-response) 

.81 

.79 

8 

Number series completion (little chance) 

.74 

.69 

9 

Analogies (4-response) 

.79 

.70 

10 

Number comparison (later discarded) 

.70 

.64 


It should be noted that the R—W formula applies to 
two-response tests. The only two-response tests among 
the ten of Table 56 are numbers 3 and 6. In both of these 
cases the R — W method gave .06 higher correlation coeffi- 
cients. In all other cases (where the R—W formula obvious- 
ly does not apply), the correlation was lowered by the use 
of corrected scores. 


R. M. Yerkes, (Editor), Psyehologiccl Examining in the U. S- Army, Memoirs of 
the National Academy of Science, V^. 15 (1921), pp. 305 and 339. 
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The effect of corrections for chance on reliability. The 
next study of the problem seems to be that of Ruch/ who 
studied the effect of corrections for chance by the formula 
W 

R—j -, — rr on seven tests of the Terman Group Test of 

Mental Ability. Table 57 shows the results. The reliabilities 
were almost uniformly, but slightly, lower when the scores 
were corrected for chance. This study was based upon an 
entirely inadequate number of cases (43). 


TABLE 57 


Reliabilities of Separate Tests op the Terman Group Test of 
Mental Ability When Corrected and Uncorrected for Chance 


Test 

Uncorrected 

Corrected 

Type 

1. Information 

.49 =*=.078 

.45 *.082 

4-response 

3-response 

2-response 

Yes-No 

2. Best answer 

.40 ±.086 

.38 *.088 

3. Word meaning 

.67 =*=.056 

.56 *.071 

6. Sentence meaning 

.53=1= .073 

.47 *.080 

7- Analogies 

.53 *.074 

.55 *.072 

4-response 

True-False 

8, Mixed sentences 

.68 =*=.056 

.56 *.071 

9. Classification 

.35 =*=.091 

.41 *.086 

5-response 


Average 

.52 

.48 





Ruch and Stoddard* next reported similar results for 
several types of objective tests over the same subject- 
matter. Table 58 summarizes the results. 

Again, there is little evidence of increased reliability by 
the use of corrections for chance. (The numbers of cases 
varied from 133 to 137.) The three-response test was 
helped by the correction. 

Paterson and Langlie* next published results very much 
in harmony with the foregoing. For 111 students (classes 
of two successive years) taking a true-false test in general 


Kj. M. Rudi, The Improtement of the WritUn Examinattofu (Chicago; Scott Foresman 
and Co., 1924), p. 119. 

» PP- 117-118. The full accoxmt of this study appears in: G. M. Ruch and 
G. D. Stoddard, ‘^Comparative Reliabilities of Five Typ^ of Objective KTamingtnwg ” 
Journal of Educational Psychology , Vol. XVI (1925), pp. 89-103. 

_ *E>. G. Paterson and T. A l^ghe, “Empirical Data on the Scoring of True-False 
Tests, Journal of Applied Psychology, Vol. IX (1925), pp 339-348. 
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psychology, the reliability for “rights” was 0.63 ±.037 and 
ior R—W was 0.54 *.045. 

TABLE 58 


Reliability Coefficients Corrected and Uncoreected for 
Chance Elements 


Type 

Uncorrected 

Corrected 

(w-1) 

5-response 

.80=fc.021 

.77=fc.023 

3-response 

.60=t.037 

.67=fc=.031 

2-response 

.74=1= .027 

.68=^.031 

True-false 

.56=i=.040 

.41=1= .049 



The evidence surveyed thus far is rather unfavorable 
to the use of the corrections formula, but it should be pointed 
out that relatively small populations were used in all these 
studies. For this, and other reasons, it will be unwise to 
draw conclusions at this time. 

The effect of corrections for chance on validity. During 
the year 1924-1925 Wood and Ruch, working independently 
under grants j&om the New York Commonwealth Ftmd, 
attacked the matter of correction for chance from the stand- 
point of the relative validities of R and R—W scorings. 
These results were not published until 1926.’^ 

Certain facts should be noted about the conduct of these 
investigations. (Chapter XI gave a short description of the 
procedures.) Wood used college examinations in law and 
anatomy courses. His criteria of validity differed somewhat 
from one course to another but in gene^ included instruc- 
tors’ judgments, essay-examination marks, and other 
objective tests. The tests used were true-false examinations 

iB. D. Wood* “Studies of Achievement Tests,” Journal of Edttcationdl Psychology, Vd. 
XVII (1926). pp. 1-22, 125-139, and 263-269. _ 

G. M. Ruoi €t al. Objective Examinaiion Methods in the Social Studies (C^cago: Scott* 
Foresman and Co., 1926). 

G- M. Ruch and M. H. DeGraflf, “Ckarections for Chance and ‘Guess* vs. ‘Do Not 
Guess’ Instructions in Multiple-Response Tests,” Journal of Educational Psychology, Vd. 
XVII (1926). pp. 368-375. 
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of from 100 to 200 items. Wood’s students were told to 
omit items when the answering would be a pure guess. 
DeGraff and Ruch attacked the same problem on the 
elementary- and high-school level, using a criterion of scores 
on a simple and highly objective recall test. Against this 
criterion several multiple-response and true-false tests were 
tried out. DeGraff and Ruch went one step further than 
Wood in that the former administered their tests half with 
instructions to guess, and half with instructions not to guess 
but to omit when in serious doubt. 

Since the section on comparative validities (Chapter XI) 
summarized both studies, the results are not repeated here. 
Wood found an average validity coefficient of 0.721 for i? 
scores and 0.769 for R—W scores, a difference of about 
0.05. Wood used populations of 100 except in one case 
where N was 74. ^rrection also increased the validity of 
three law examinations by an average of 0.055, the criterion 
being six essay examinations. 

Wood’s data on the reliabilities of R and R—W scorings 
for these same law examinations and also one in French 
were given in Table 35 of Chapter XL In general, the former 
gave slightly more reliable results, and Wood says, “In no 
case does the score suffer by comparison with the R—W 
score, and in only one case does R—W compare at aU favor- 
ably as to reliability.”* 

As before, correcting for chance lowered the reliability 
slightly, but, more important, the validity was materially 
increased. This study and that of DeGraff and Ruch place 
a different light on the whole problem of correction. 

The results of DeGraff and Ruch were given in some 
detail in Tables 27 and 36 of Chapta: XI, In order to 
bring both reliability and V2ilidity coefficients together in 
one place, and also in order to see the effects of instructions 
about guessing. Table 59 is given. 


*jLoc. cit. pp. 8-9. 



TABLE 59 

Intercorrelations, Corrected and Uncorrected for Chance, for All Ten Tests Used 



indicates the tests taken under instructions to guess. 

(n) indicates the tests taken under instructions not to guess. 
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It is to be noted that all recognition (multiple-response and 
true-false) tests were corrected for chance by the formula 
W 

R-- although five- and seven-response tests are 

(«-l) 

never corrected in actual practice. Our chief interest there- 
fore attaches to those tests having two or three responses. 
Rows (13) and (14) of Table 55 give the average correla- 
tions, both validities and reliabilities, for all of the two- 
and three- response tests and for all of the two-response 
tests, respectively. 

The results are quite in harmony with Wood’s results in 
the sense that validities are increased by the use of correc- 
tions for chance. They disagree with Wood, and all pre- 
ceding investigations, in that correction for chance increased 
reliability. This is somewhat surprising, but much weight 
must be given to the fact that this is the most extensive 
investigation yet published, each correlation being based 
upon more than 200 cases and almost 2500 cases being used 
to find the reliability of the recall tests. Rows (11) and (12). 

E. P. Wood was the next to publish results bearing on the 
question of correcting scores for chance.^ Mrs. Wood used 
a lengthy criterion built up from seven separate measures. 
The reliability and validity of this criterion is hardly open 
to question. She gave to 147 college students in a course in 
government four tests as follows: 

(c) A true-false “do not guess” test of 210 items. 

(b) A 5-response “do not guess” test of 159 items. 

(c) A simple recall “do not guess” test of 227 items. 

id) An old-tjT)e test (given two days prior to the three 
objective tests). 

These tests did not cover “precisely” the same information; 
all were intended to be the best possible one-hour test on 
government. The items were largely problematical rather 
than informational. Her principal results follow: 


»E. P. Wood, ' ^ 

Educational Psychology Vol. . 


the Validity of CoUegwte Achievement Tests,” Journal of 
:VHI (1927), pp. 18-^ 



CHANCE AND GUESSING IN TESTS 


339 


Test 

Correlations With the 
Criterion 

No. OF Items 

Rights 


^ («-l) 

True-false 

.748 

.845 

213 

Five-response 

.850 i 

.860 

159 

Simple-recall 

.880 

... 

227 


As Mrs. Wood concludes, her results are quite in harmony 
with those of Ben D. Wood and of Ruch and DeGraff. 

Effects of instructions concerning gnessing. Table 60 
presents the findings of DeGraff and Ruch^ on the value of 
attempting to control the chance factors in recognition tests 
by the use of instructions for and against guessing. 

TABLE 60 


Effects of “Guess” and “Do Not Guess” Instructions on Test 
Scores (After DeGraff and Ruch) 


{ 

Type CHT Test 

Average Score on 200 Items 

Differences 

(a) 

Rights 

(« 

R 

^ (n-i) 

{a-b) 

Recall 

55.4 

.... 


7'Response (g) 

89.6 

74.1 

15.5 

7-Response (n) 

82.1 

73.1 

8.0 

DifFprPTirp<5 

7.5 

1.0 


5-Response (g) 

99.7 

78.8 

20.9 

5-Response (n) 

90.9 

78.7 

122 

Differences 

8.8 

0.1 


3-Response (g) 

118.8 

80.1 

38.7 

3-Response (n) 

102.3 

78.0 

243 

Differences 

16.5 

2.1 


2-Response (g) 

138.9 

80.7 

58.2 

2-Response (n) 

125.4 

85.9 

39.5 

Differences 

13.5 

-5.2 


True-false (gj 

127.2 

59.3 

67.9 

True-false (n) 

108,6 

57.6 

51.0 

Differences 

18.6 

1.7 



’^Objective Examination Methods in the Soctal Studies, p. 87. 
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Comparison of the differences of Column (a) of Table 60 
shows clearly that perhaps fifty per cent of the guessing 
in multiple-choice tests can be eliminated by instructing 
pupils against guessing. This finding of itself proves nothing 
as to the merits of guessing or omitting unknown items. 
There is a sharp division of opinion on this point. McCall 
favors “guess” instructions. Wood insists upon “do not 
guess.” Most authors say nothing about guessing in their 
test instructions (judging by standard tests). The author 
has always tended to agree with Wood, particularly in the 
light of the work of DeGraff and Mrs. Wood. McCall and 
others incline to think that the more items guessed at, 
the more adequate the correction formula will be in eliminat- 
ing such effects. The real test is the effect of such instruc- 
tions on the validity and reliability of the scores. Table 61 
gives us just such an analysis of the results of the experi- 
ments of DeGraff and Ruch. 

Table 61 shows fairly clearly that instructions against 
guessing increase the reliability somewhat, especially in 
comparisons with uncorrected scores (rights) when pupils 
are told to guess at all times. Table 62 presents a similar 
analysis of validity coefficients for the same investigation. 

Table 62 indicates again the riight superiority of “do not 
guess” instructions, especially when correction of scores is 
practiced in connection with warnings against guessing. 

It must be admitted that the evidence in favor of non- 
guessing at items (the answering of which appears to be 
sheer chance) is not very conclusive. We must leave this 
issue with the tentative recommendation against wide- 
spread guessing. The evidence for this conclusion, as we 
have seen, rests on three facts: 

(1) Instructions against guessing lowered average scores 
(rights) about ten to fifteen per cent. This means that 
pupils tended to omit rather than to guess, although there 
is no means of knowing how much guessing was still present. 
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(2) Validities were slightly higher for “do not guess” 
directions, especially when scores were corrected. 

(3) Reliabilities were somewhat higher for “do not 
guess” instructions, particularly when compared with tm- 
corrected scores under instructions to guess. 

Is the true-false test a two-response test? Tables 48, 
50, 51, and 60 have raised definitely a question whether true- 
false and two-response multiple-choice tests are equally 
difficult. When the “same” items are administered in each 
of these two types, the true-false appears to be significantly 
the more difficult (“same” being defined in the sense shown 
by the items used below as an illustration). Granting, for 
the sake of discussion, that this is a true finding, what is the 
probable explanation? 

One theory that has been frequently advanced, althou^ 
not in explanation of the difference under discussion, is that 
false statements have a negative suggestion effect. If this 
is true, part or aU of the difference may be accounted for. 
We shall see in the next chapter that the evidence for such a 
negative suggestion effect is not wholly conclusive; in fact, 
the evidence points to the feet that such an effect is of small 
consequence. 

A second h 3 qx)thesis advanced by the author*^ may be 
somewhat more promising. As a basis for discussion, the 
following sample items are quoted from the investigation 
by DeGraff and Ruch: 


Two-Response 

1. Christopher Columbus discovered America in the year (1) 1498 
(2) 1492 

2. The first RIgrims were brought to American shores in the ship named 
the (1) Mayflower (2) Half Moon 

3. Eli Whitney is noted for his invention of the (1) spinning jenny 
(2) cotton gin 


‘^Journal of Education^ Psychology, Vol. XVII (1926), pp. 374-375. 
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4. Robert Fulton's contribution to civilization was the development of 
the (1) Atlantic cable (2) steamboat 

5. The first permanent English settlement in America was (1) Ply- 
mouth (2) Jamestown 


True-False 

1. Christopher Columbus discovered America in the year 

1492. 

2. The first Pilgrims were brought to America in the ship 

named the Mayflower. 

3. Eli Whitney is noted for his invention of the spinning 

jenny. 

4. Robert Fulton's contribution to civilization was the 

development of the steamboat. 

5. The first permanent English settlement in America was 

Plymouth. 

Certain facts are obvious, a priori: 

1. If a pupil is absolutely ignorant of an item, the chances 
of success are 50:50 on both true-false and two-response 
tests. 

2. If he has a “fringe” of knowledge in either case, he 
will succeed more often than he fails. 

3. If the chances of success are 50:50, the chance correc- 
tion formula (i?— IIQ is equally valid in both types of tests. 

4. If suggestibility enters into either type of test, it is 
equally divided between the two responses of the true-false 
test;^ or at least approximately so. 

5. If suggestibility, in the sense of a tendency to accept 
any statement suggested as true as being true, or vice versa, 
enters into the answering of true-false tests, there wiU be 
no na effects of an excess of statements marked either “true” 
or “false,” if the test contains an equal number of true and 
false statements. 

^Mathews, Journal of Educational Psychology, VoL XVIH (1927), pp. 445-457, on the 
contrary, has brought forward evidence that in two-response tests the nrat stated alternative 
is selected 3.2% oitener than is the resix>nse which comes second, and also that the upper 
response is selected 33.8% oftener th^ the lower when the two are printed vertic^y. 
This issue will be discussed later. A student of the aut^r has just competed a repetition 
of Mathews's study. 
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Returning now to the second hypothesis, let us divide 
any statement of an item into two parts: (1) the critical 
statement, and (2) the completion. 

Thus, for item 2 the critical statement is: “The first Pil- 
grims were brought to American shores in the ship named 
the . . 

The completions are: (true-false) Mayflower; (two-re- 
sponse) (1) Ma3^ower (2) Half Moon. Note that the true- 
false type does not mention at all the Half Moon. 

If a pupil is ignorant of the name of the ship which brou^t 
the first Pilgrims to America, it does not necessarily imply 
that he is equally ignorant of the Half Moon. He may 
have, in fact, associations which couple the Half Moon 
with Henry Hudson. If so, he has a basis, indirect to be 
sure, of arriving at the correct answer by elimination. The 
chances are obviously not 50:50 here, althou^ they might 
be if he met the statement in true-felse form. 

The point of the foregoing discussion is that both true- 
false and two-response tests present 50:50 chances for success 
when absolute ignorance is the case, but that the two- 
response is more open to successful answering than is the 
true-false when knowledge is present about either one of 
the two response words. This fact alone rnigh t, in the long 
nm, make two-response tests somewhat less difficult. 

The next section shows a method by which this problem 
may be attacked statistically. 

The determination of the best method of scoring multiple- 
response tests through partial and multiple correlations.^ 
Thurstone* seems to have suggested the selection of the best 
scoring method through the technique of multiple correla- 
tion. Starting with the general formula S—R-\-CW, in 
which S is the score, R the number of rights, W the nmnber 

^The student not familiar with parti^ azid multix^ correlation methods will probably 
not follow the full argument of this sect io n. 

*L. L. Thuxstone, "A Scoring Method for M^tal Testa,'^ Psychological Bulletin, Vob 
XVI (1919), pp. 235-24a 
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of ■wrongs, and C is a numerical constant which would de- 
termine the amount of deduction for errors, he derives the 
following formula for finding C: 

in which R, W, and C have the 

<^w(riw‘rKw-rTg) 

meaning stated above, and I means the criterion variable. 

It should be noted at the outset that in case C takes on 
the value —1.00, this general formula becomes TF, 

or the usual one. Moreover, it is just as convenient to 
employ the more familiar form of solution of a problem in 
mifitiple correlation, and especially so since Thurstone did 
not expand his derivations to include the fourth possible 
variable, viz., the number of omissions. E'vidence to be 
presented here shows that a general formula of the type 
S=Cii?-t-C 2 pF-|-C 3 (? is needed when instructions to omit 
items rather than to guess are used. (Thurstone apparently 
had in mind that all items would be attempted. At the time 
Thurstone ■wrote it ■was almost universal practice to instruct 
students to guess when in doubt.) 

If evidence continues to accumulate to demonstrate that 
the value of the constant C in Thurstone’s formula is not 
—1.00, it will be possible eventually to replace the a 


priori formula S=i? 


W 

(»-l) 


by empirical multiple-regress- 


ion equations. 

Several investigators, including Thurstone, have found 
that the values of C are not exactly the a priori ones (which 
are the ones commonly used •when the R—W, R—^W, 
R—\W, etc., are used for two-, three-, and four-response 
tests, respectively).^ Brinkley,® following the procediue of 
Thurstone, tried out a number of methods of scoring tests. 


T. L. Kelley, about 1921, me n tioned to the author in the course of a private con- 
versation that certain studies of his own had shown that was more defensible 

than R —W for scoring true-false tests. 

^‘“New-Type Examinations in the High School,” pp. 94-97. 
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Before starting with a discussion of the results of Brinkley 
and others, it will be well to define the subscripts used 
throughout this section: (1) represents the criterion or 
dependent variable. (2) represents the> number right. (3) 
represents the number wrong. (4) represents the number 
of omissions. 

Table 63 gives an abstract of the experimental findings 
presented by Brinkley. 


TABLE 63 


Summary of Brmeoey’s Study of the Proper Scoring Fckmula for 
Multiple-Choice Tests 


COSBELATIONS 

True-False 
( 31 Items) 

3-Response 
(30 Items) 

4-Response 
(30 Items) 

5-Response 
(30 Items) 

ri2 

-784 

.677 

.667 

.763 

ns 

-.742 

-.660 

-.570 

-.615 

TiS 

-.753 

-.926 

-.894 

-.875 

Best weighting for Wrongs 





(Thurstone's C) 

-.81 

-.53 

+.18 

+.22 

Correlation with criterioni 
W I 

when scored R— •(yi-T) 

.82 

.68 

.67 

.76 


One other point should be noted about Brinkley’s results, 

W 

viz., that scoring by the formula S =i?— gives better 

results than “number right” only in case of the true-false, 
and here the difference was but 0.04. Brinkley apparently 
did not attempt to find out whether the formula S = .R — .8 1 PF 
(using the value he found for C) would have been even better 
than H—W for the true-false test. Brinkley’s results suggest 
that the H—W formula iwer-corrects true-false tests, i.e., 
penalizes unduly. A better formula would be J?— .8W', or 
ri^ts minus four-fifths the wrongs. We diall see later how 
well this agrees with later studies. 
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Two Other studies may be mentioned, those of Mrs. E. P. 
Wood and of R. R. Foster and G. M. Ruch, curiously enough 
published simultaneously.^ 

Mrs. Wood used an elaborate criterion composed of seven 
separate measures pooled. The subject was a course in 
government, and 147 cases were used in finding the corre- 
lations. It is to be noted that Mrs. Wood used tests imder 
instructions “not to guess,” and hence her results show the 
influence of the fourth variable, omissions; in this respect 
the study is analogous to that of Foster and Ruch. Table 
64 shows her results. 


TABLE 64 


Summary of E. P. Wood’s Results on the Sccsong op Tests 
When Chance Is Involved 


Correlations 

True-False 
(213 Items) 

5-Responsb 
( 159 Items) 

Completion 
(227 Items) 


.748 


.880 


-.264 

-.247 

-.085 


-.480 

-.615 

-.730 


.145 

-.146 

-.052 


-.864 

-.792 

-.840 

734 

-.607 

-.459 

-.459 

i?l*234 * . • 

.847 

.861 

.890 

ri2 

.748 

.850 

.880 

Gains from use of wrongs and 




and omissions 

.099 

.011 

.010 


The last line of entries in Table 64 was computed by the 
author. It is to be noted that this use of wrongs and omis- 
sions is of no help in five-response and completion tests, but 
raises the validity about .10 in the case of the true-false 
test. Unfortunately Mrs. Wood did not publish her final 


IE. P. Wood, 'Improving the Validity of CoUemte Adiievement Tests,'' Journal of 
Educatvmal Psycholopt, Vol. XVIII (1927), pp. 18-25. 

R. R. Foster and G. M. Ruch, “On Corrections for Chance in Multiple-Responae 
Tests," tbid., pp. 48-51. 













CHANCE AND GUESSING IN TESTS 349 

regression equations. She did not compute the values for 
C in Thurstone’s formula. 

The study of Foster and Ruch differed somewhat from 
that of Mrs. Wood in several respects. In the first place 
(using the data of DeGraff and Ruch), the criterion was that 
of simple-recall scores (reliability .97) on the “same” items 
(2(K) in number) as used in all multiple-response tests em- 
ployed. Second, the instructions were varied to include 
both “guess” and “do not guess” directions. In the third 
place, the numbers ran somewhat larger (221-281). Lastly, 
the subjects were elementary- and high-school pupils in 
history. Tables 65 to 69 give the results. 

TABLE 65 

Correlation Coefficients of the Zero Order 


Instructions to Guess 



True-False 

2-Response 

3-Response 

S-Response 

ri2 

.818 =±=.014 

.901 .008 

.875 =±=.010 

.905=^.007 

Tia 

-.769=fc.017 

-.894 =±=.009 

-.757 =±=.018 

-.701 =t .022 

ras 

-.627 =t. 026 

-.903 =±=.008 

-.675 =*=.023 

-.582 =±=.029 


Instructions Not to Guess 


ri2 

.784 ±.016 

.768 =±=.016 

.887 ±.009 

.897 =±=.007 

ri3 

-.228=±=.039 

-.381 =±=.034 

-.405 =±=.037 

-.364 ±±=.035 

ri4 

-.510=±=.030 

-.466 =±=.031 

-.549 ±.031 

-.583 *.026 

Tis 

-.261 =±=.038 

-.038 =±=.040 

-.107 =±=.044 

-.117=*=.040 

T 24 

-.897 =±=.008 

-.841=fc=.011 

-.816 =±=.014 

-.811 *.013 

ru 

-.651 =±=.023 

-.495=fc.030 

-.476=±=.034 

-.477*.031 


TABLE 66 

Coefficients of Multiple Cokrelation 



True-False 



5-Response 

Ri.it (guess) 

jRi -284 (do not guess) 

.882 

.920 

.903 

.930 

.903 

.846 

.940 

.935 
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TABLE 67 

REGRESsic»f Equations for the Raw or Gross Scores 


Instructions to Guess 


1. True-false 867 X 2 - . 772 X 5 - 2358 

2. 2-tesponse Xi=.762X5— .693X5—11.553 

3. 3-response X, =. 716 X 2 - .384X5- 4.853 

4. 5-response Xi=. 666 X 2 - . 268 X 5 - 1 - 7.439 


Instructions Not to Guess 


1 . True-false Xi=.785X2- . 8 O 6 X 5 - .059X4-1-14.709 

2 . 2-response Xi=.511X2-1.114X5- .433X4-1-61.032 

3. a-response Xi =.632X2- .505X5- .114X4-M9.916 

4. 5-response Xi= 337 X 2 - . 72 OX 5 - .409X4-1-85.362 


TABLE 68 

Regression Equations Using Standard Measures^ 


Instructions to Guess 


1. True-false. 

2 . 2 -response. 

3. 3-response 

4. 5-response. 


Z1-.553Z2-.4222:, 

Z1-.5O8Z2-.435Z, 

Zi-.668Z2-.305Z, 

Zi=.751Z2-.263Z3 


Instructions Not to Guess 


1. True-false Zi - . 84 IZ 3 - .SOIZ, - .O 82 Z 4 

2 . 2 -response Zi=.421Zt-.557Z,-.401Z4 

3. 3-response Xi=.738Z*-390Xs— -149X4 

4. 5-response Xi=.397Z*-.573Xs— . 541 X 4 


t'T • j j X—M X 

IS defined as . 

<r <r 

The results of Wood and of Foster and Ruch show that 
there is some theoretical advantage in taking into account 
both wrongs and omissions in scoring true-felse, two-re- 
sponse, and possibly three-response. Coupled with the 
preceding evidence to the effect that it is somewhat better to 
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TABLE 69 

Improvement of jRi.23 or Ri.m over m in the Results of 
Foster and Ruch 



use instructions against guessing, it may be tentatively con- 
cluded that Thurstone’s proposals (or the equivalent multi- 
ple-regression equation method) might well be employed in 
studies requiring a high degree of accuracy. It can hardly 
be defended that such refinements are needed in actual 
classroom tests. It is clear that even the omissions have 
some significance in arriving at the best scoring formula. 
On the other hand, the teacher emplo 3 dng tests of 100 to 
250 items may safely score her tests simply “number right” 
W 

or perhaps R—- — in the case of three responses or 

in-1) 

fewer, especially when pupils are instructed against wild 
guessing. 

Thurstone’s formula for obtaining the value for C (the 
best weighting of wrongs) cannot be applied to the data of 
E. P. Wood or the results of Foster and Ruch under “do not 
guess” instructions because it makes no allowance for the 
fourth variable in the situation, viz., errors. A formula which 
would also allow for the inclusion of errors may be derived 
quite readily. In an absence of such a formula, the four- 
variable multiple-regression equation may be applied. 
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particularly if standard measures (sigma values) are em- 
ployed.* 

Table 70 shows the values for C found by Brinkley and 
by Foster and Ruch, as well as those suggested by the 
W 

formida S=R—- -r. 

(n-1) 

TABLE 70 

Values of C for the Results of Brinkley and Foster and Ruch 


Type of Test 



True-false 

2-Response 

3-Response 

5-Response 

Values of C as found by 
Brinkley 

-.81 


-.53 

.22 

Values of C as found for 
Foster and Ruch data. . 

-.89 

-.91 

-.54 

-.40 

Values of C indicated by 
W 

R— 7 — ^ formula 

(»— 1) 

-1.00 

-.100 

-.50 

-.25 


The foregoing table shows far from perfect agreements of 
calculated and expected values (as obtained by the conven- 
tional scoring formula). Only in the case of the three- 
response is there reasonably close agreement. This table 
shows the need for extended study of the issue in question. 

Such data as are at hand suggest the possibility that the 
R— PF formula over-corrects by at least ten per cent, pos- 
sibly by as much as fifteen to twenty per cent in two-response 
and true-feilse tests. In fact, there is some evidence of over- 
correction (undue penalization) in all multiple-response 
tests. 

On the other hand, even an over-correction of fifteen to 
twenty per cent is not of very great practical significance in 

^This inmlies assuming rectilinearity' of regression, of course. It should be noted that 
Foster and Ruch found many correlations (especially when “do not g^uess*' directions were 
emmoyed) which were markedly curvilinear. Wood did not comment on the decree of 
rectilineanty of her data. 
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informal classroom testing, although the available evidence 
suggests that careful experimental studies may well afford 
to employ either the technique of Thturstone or the more 
familiar (but equivalent) procedure of Foster and Ruch. 

In conclusion, it may not be out of place to suggest that 
studies along the direction suggested by Thurstone, Brinkley, 
Wood, and Foster and Ruch be multiplied until generaliza- 
tions may be made leading toward a more final solution of 
issues discussed here. 

Some proposals for modified true-false tests. There have 
been numerous proposals looking toward a betterment of 
the true-false test. These have been directed chiefly at the 
elimination of chance errors, although some have been con- 
cerned with the matter of avoiding negative suggestion 
effects through the use of the interrogative statement 
followed by “yes-no” or “right-wrong.” 

The “true-false-doubtful” and the “yes-no— didn’t-say” 
illustrate another line of thinking. (See Chapter VIII for 
examples.) 

Recently Christensen^ and Greene* have suggested 
interesting variations of the true-false. The former pro- 
poses using the “same” item in both true-false and multiple- 
choice form in the same test, i.e., each item occuring twice. 
To earn one unit of score, both items of such pairs must be 
correct. He found the reliabilities to increase markedly 
in such arrangements. Thus, for 1(X) true-false and multiple- 
choice items in general science, Christensen obtained the 
following reliabilities: 


True-false alone {R—W scores) 67 

True-false alone {R scores) 76 

Multiple-choice alone (i? scores) 80 

True-false and multiple-choice combinations 

(Christensen's plan) 90 


^Journal of Educational ResearcK VoL XIV (ld26)» pp. 370-374. 
Vot XVII (1928), w). 102-107. 
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Similar gains were shown for validity when the Van 
Wagenen General Science Reading Scale A was used as a 
criterion. 

He shows further that to obtain a reliability of .90 (that 
of the combined plan) would require tests of these lengths: 


True-false (ff — W) 450 items 

True-false (If) 288 items 

Multiple-response 200 items 


Moreover, the combination plan would save time. Against 
this undeniable advantage in reliability are certain rather 
serious disadvantages: (a) the scoring is somewhat compli- 
cated, and {b) 200 items give no wider sampling than do 
100 items of either true-false or multiple-choice. To double 
the sampling might well prove to be more valid than to 
double up on half that number of items. This point should 
be investigated before accepting this suggestion. 

Greene’s suggestion is lightly more workable since he 
has simplified the problem of scoring and tabulating. This 
writer would give each item both as a true and as a false 
statement. Items one to fifty present the statements, and 
items fifty-one to one hundred repeat the former, making 
the true statements false, and vice versa. As before, one 
point credit is given for each pair correct. Greene’s evidence 
does not appear very condtisive, as he points out. More- 
over, Greene comments on the fact that 1(X) different items 
may be superior to fifty paired items. His suggestion is 
made more in the effort to encourage study of methods of 
bettering true-false tests than as a practical suggestion for 
immediate adoption. 

Another proposal for the modification of the true-false 
test is that of McClusky and Curtis. The suggestion of 
these writers is that pupils mark all true statements with a 
“T” but “correct the statements you consider to be feilse 
by substituting an appropriate word or phrase for some 
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word or phrase in the original statement so as to make it 

true No credit will be given for false statements 

which are corrected merely by the insertion of the word 
‘not’.”! 

The proposed method requires roughly double the working 
time of the regular true-false test in the seventh grade, about 
forty per cent more time in high school, and about one-sixth 
more time in college classes. The average scores on the 
modified test are somewhat smaller, and more time is needed 
for scoring. 

Except in one case, the modified form showed from .15 
to .20 higher reliability coefficients. In general, the reliability 
coefficients under the modified form are little higher than 
would have been the case had the same amount of time been 
used for the regular true-false test, i.e., had equal working 
times been compared using the Spearman-Brown prophecy 
formula. This fact tends to discount the value of this 
proposed innovation. In view of the extra labor, the net 
gain of this method would be very small indeed. 

Chapter summary. The following statements summarize 
the findings presented in this chapter, although many of 
the statements made must be taken with great caution imtil 
further esperimental evidence is at hand: 

1. There have been two general approaches to the problem 
of the correction of test scores for chance effects, viz., a 
priori or mathematical discussions based upon theorems of 
probability and experimental studies. The latter are 
probably more trustworthy. 

2. Many criticisms of the true-false test (especially) have 
been based upon a misxmderstanding of the significance of 
probability. 

3. When there are no omitted items, the score may be 
computed either as “number right” or “rights minus 

iH. Y. McClusk^ and F, D. Curtis, **A Modified F<M7n of the True-False Test,” Jomnal 
of Educational Research, VoL 3ilIV (1926), pp. 213-224. 
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wrongs,” since such scores correlate perfectly (Holzinger’s 
proof). 

4. Wrong answers are of at least two kinds: (o) answers 
guessed at, but incorrect; (&) answers not guessed at, but 
answered in good faith, although products of straight mis- 
information. 

5. The binomial theorem will give the mathematical 
expectancies of right and wrong responses, provided that all 
wrong answers are pure guesses. It cannot cover the issue of 
misinformation and consequent wrong responses. 

6. Whether correction for chance lowers or raises the 
reliability is still debatable. Most of the studies point 
toward a lowering, although one of the most recent and 
extensive points toward the opposite effect. 

7. Regardless of di^t effects upon reliability, the use of 
corrections for chance should be decided upon the run of 
the evidence bearing on the effects of such corrections upon 
validity. 

8. The evidence is almost without disagreement to the 
effect that correction for chance increases the validity of 
test scores, especially when true-false, two-response, and 
three-response tests are concerned. 

9. The teacher who does not wish to trouble to correct 
test scores for chance may avoid this labor by making her 
tests ten to fifteen per cent longer than originally plaimed 
and thus eliminate the need for correction. 

10. The available evidence suggests that both more valid 
and reliable scores are to be obtained by instructing pupils 
to omit items where the answering is nothing more than a 
dieer guess, i.e., by using “do not guess” instructions. 
This does not mean that pupils should not attempt items 
about which they have definite “hunches” or “fringes of 
knowledge.” 

11. The two investigations (E. P. Wood and DeGraff 
and Ruch) which studied the combined question of instruc- 
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tions about guessing and corrections for chance suggest that 
the best practice is (a) instructions against widespread guess- 
ing, combined with (ft) corrections for chance. 

12. Some argumentative evidence was introduced to the 
efifect that the true-false test differs considerably in its 
psychology from the two-response test projjer, in that the 
latter suggests more facts which might form the basis for 
the answering of the item. 

13. Thurstone has provided a convenient formula for 
weighting wrong responses when “guess” instructions are 
employed. 

14. The method of Thurstone and the multiple regression 
equation method of Foster and Ruch suggest ttet the H—W 
formula over-corrects from ten to twenty per cent in the 
case of true-false and two-response tests (as applied to the 
data of Brinkley and of Foster and Ruch). 

15. There is some evidence that the general formula 


S=i?- 


W 


• over-penalizes in all multiple-response tests, 


(«-l) 

but much work is yet to be done on this problem. 



(aiAPTER Xffl 


THE NEGATIVE AND OTHER SUGGESTION 
EFFECTS IN THE TRUE-FALSE TEST 

The issue stated. Modem psychology of teaching empha- 
sizes the danger of practice in error^ i.e., the exercising of 
wrong netiral coimections. In teaching spelling or arithme- 
tic, for example, the teacher makes every effort to catch 
errors before constant repetition stamps in such wrong 
reactions. It must be admitted that associations are built 
up in exactly the same manner, whether the ideas associated 
are right or wrong. 

Many critics have held the false statement of the true- 
false examination to be dangerous pedagogy in that it tends 
to implant misinformation in the mind of the pupil. Since 
about fifty per cent of the statements in such a test represent 
falsities, it is evident that great opportiinity exists for fixing 
error in the minds of pupils reading such false statements. 
It is perhaps more than a little curious that such critics have 
never seemed to reckon with the fact that the true state- 
ments might tend to fix truths in the minds of pupils. 

The question at issue has usually been known by the 
name of negative suggestion. 

As has previously been mentioned, some authorities think 
that the stating of true-false items in question form will 
avoid this danger. Thus: 

Is milk white? Yes No 

Do all plants have flowers? Yes No 

So far as the author knows, the believers in the negative 
suggestion effect have discontinued their psychologizing at 
the point where they discovered the possibility of such 
negative learning. 


358 
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There is, as a matter of fact, a great deal known in modem 
psychology which bears on this issue. For example, there is 
the psychology of the mental set or idea-in-mind. Undoubted- 
ly the danger of negative learning is greatly dependent upon 
the attitude of mind of the pupil t aking such a test. 

We must grant at once that if a page of spelling words, of 
which approximately fifty per cent were misspelled, were 
placed before a pupil for study, there woul4 be much practice 
in error. The sanctity of the textbook would see to that. 
On the other hand, the attitude of a pupil toward a tme-false 
test is not the uncritical, passive, conscious set which we 
have described in the case of the page of spelling; but on the 
contrary, the directions to the test and past experience alike 
cause him to assume a critical, challenging, and thoughtful 
attitude. He knows in advance that about half of the state- 
ments are untruths. His task is to find which are which. 

Moreover, there is some logic in assuming that pupils 
should be exposed to situations calling for discrimination 
between truth and error in exactly the same way that we 
have come to hold that real moral and ethical diaracter 
cannot be attained through “running away firom sin without 
a battle.” 

Pages might be written on both sides of this argument. 
It is better to turn to the meager experimental evidence 
which exists. It is to be admitted that this evidence is 
insufficient in quantity. On the other hand, if the negative 
suggestion effect is one-half or even one-fourth as great as 
some critics suppose, it would bob up alamoingly in even 
the cradest experimentation. 

Ballard’s investigation. The first study on the suggestion 
effects of true-Mse tests seems to be that of Ballard.^ 
He gave true-false tests in geography and history to classes 
of thirteen-year-old boys. Later the same questions were 


ip. B. Ballard, The New Examirwr^ (London: Hodder and Stou^^ton, 1924), pp- 96-98. 
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presented in short-answer form and discussed briefly in 
class. Still later the same questions were given in recall 
form. Some of his results follow: 


Subject 

: Gains From Initial Truk-False 

I TO Final Recall 

1 

“Trues” 

"‘Falses” 

Geography 

17% 

W° 

History (I) 

7% 

73% 

History (II) 

14% 

64% 


From this Ballard concludes (p. 96) “ it will be 

found, I think, that children will learn more from the true- 
false test than from any other type of examination, and it 
is a curious fact that they will learn more from the false 
items than from the true.” 

BaUard does not state how many cases were used, but all 
three classes showed uniformly that the gains on the false 
items of the (original) test were much larger than on the 
“trues.” The significance of these results is not altogether 
clear; so we should examine certain further studies before 
attempting to draw conclusions. 

The study of Semmers and Eemmers.^ These authors 
chose a selection from the “Customs of the Germans.” On 
this passage they biult 121 true-false and 121 simple com- 
pletion (recall) statements, each statement being made in 
both true-false and recall types. Two groups of college 
students were equated by intelligence-test scores. These 
were designated as Groups I and II. On the first day of the 
experiment both groups were given the selection to study 
and were told that they would have a test over it at the next 
meeting of the class. At the next meeting Group I was 
given the true-false test, and Group II the recall test. So 
far as the students knew, this ended the experiment. 

'H. H. and E. M, Remmers, '"The Negative Suggestion Effect of True-False Examina- 
tion Questions,” Journal of Educational PsychologyTVol. XVII (1926), pp. 52-56. 
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About a month later and without warning the tests were 
repeated, except that Group I was now given the recall and 
Group II the true-false. To quote the authors (p. 54), 
“The logic on which we shall base our conclusions is as 
follows : If the average score of Group I be equal to or greater 
than that of Group II, it follows that the taking of a true- 
false test can have no deleterious effect upon the formation 
of correct associations.” Table 71 presents their results. 


TABLE 71 

Results of Remmers and Remmers Study op the Negative Suggestion 
Effect of True-False Tests 



Averages 

Difference 

P. E. 

Approximate 
Chances OF 


Group I 

Group II 

OF Averages 

Diff. 

Significant 

Differences 

True-false 

71.16 

67.88 

3.28 

3.35 1 

1 tol 

Recall 

106,18 

105.00 

1.18 

1.44 

1 tol 

Sum of T-F and 
Recall 

177.34 

172.88 

4.46 

3.37 

1.6 tol 


If Group I, which took the true-false test the next day 
after reading the passage, should have gathered a number of 
false impressions, it should have showed many wrong 
answers after the month interval, and consequently a lower 
score when the recall test was taken. As a matter of fact. 
Group I did slightly better after four months than did 
Group II at the outset, although this difference has no 
statistical significance. The very slight effect noted was a 
positive rather than a negative suggestion effect. 

The investigation of Roberts and Rnch. A somewhat 
more extensive investigation is that of Roberts and the 
author. 1 Two different types of ejqjeriments were carried 
out by Miss Roberts, as follows: 

I. On a given day about half of the students of each of a 
number of high-school science classes took a completion 

iH. M. Roberts and G. M. Ruch, “The Native Suggestion Efifect of ‘lYxie-False 
Tests,'* Journal of Educational Research (Sept., 1928), pp, 112-116. 
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test, and the other half took a true-false test on the “same” 
items. Two weeks later all took the completion test (half, 
of course, for the second time). These results were studied 
by the method of averages as employed by Remmers and 
Remmers. 

The groups were later reversed with a second set of 
materials covering a different topic. 

II. A completion test was followed immediately by a 
true-false test containing the “same” items. At intervals 
from one day to one month, the completion test was ad- 
ministered to all. The two completion tests were then 
checked, item by item, to see the extent to which the second 
answers differed from the first. 

The subjects covered were biology, botany, physics, and 
chemistry. The tests contained from thirty-four to seventy- 
three items. All tests were understood by the pupils to bear 
on their term standing. In all possible respects ttiis investi- 
gation was made to be a part of regular classroom routine. 

There were some disagreements in the results of the 
experiments of the first type in different classes. The botany 
examinations showed no evidence of negative suggestion 
effects (groups of about thirty students). The chemistry ex- 
aminations (three sections of about twenty each) gave a 
statistically significant, but small, evidence of such imde- 
sirable effects. 

The experiments of the second type are far more signifi- 
cant, as the answers on each individual test item were 
followed in detail from one examination to the next. A total 
of 235 students was used in this series of testings, making 
705 test papers in all. In addition, 128 students took the 
two recall tests but not the true-false as a check on results. 
Table 72 summarizes certain of these results. 

A few statements su mm arizing Miss Roberts’s results 
follow. It is to be noted that conclusions are never based 
upon differences less than three times their standard errors, 
in accord with common statistical dicta. Some of the fol- 
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TABLE 72 

Differences Between the Average Scores on the Initial Comple- 
tion Tests and the Same Tests Repeated at Intervals op From 
One to Thirty-Four Days With True-False Tests Covering the 
‘‘Same’' Items Intervening, and When No True-False Test 
Intervened 


Interval 

Difference 

S. D. of 

Difference 

m Days 

in Averages* 

Difference 

S. D. (Difif.) 


I. Short Intervals 


Biology A 

Chemistry lA 

Chemistry SB 

Chemistry SA' 

Physics B 

1 

3 

3 

4 

5 

8.2 

4.1 

1.8 

2.1 

3.1 

0.6 

0.8 

0.4 

0.5 

0.4 

14.2 

5.0 

5.0 

4.1 

73 

21 

20 

26 

24 

24 

Total 





115 







II. Longer Intervals 


Chemistry NA 

Biology B 

Chemistry IB 

Chemistry NB 

Ph3^ics C 

Physics A 

12 

15 

20 

31 

33 

34 

2.5 

0.4 

2.2 

3.0 

0.8 

-0.4 

Bl 

BBbI 

3.7 

0.4 

3.0 

4.2 

1.0 
-0.5 

19 

17 

21 

20 

22 

21 

Total 





120 







III. Longer Intervals, No Intervening True-False Test 


Botany A, Exam. 2 

14 

1.9 


2.3 

33 

Botany B, Exam. 1 

15 

3.2 


1 2.9 

28 

Chemistry A' 

14 

0.0 

0.6 

0.1 

22 

Chemistry A" 

14 

-1.3 

0.8 

-1,6 

21 

Chemistry B 

14 

1.6 

0.6 

2.4 

24 

Total 





128 







♦Positive values indicate that the average on the second test was higher than on the 
first, and vice versa for negative values. 


loTOng condtisioDs are based upon Table 72 and some on 
data not reproduced here. 

1. As a result of the true-false test, rou^y, one answer in 
twenty was changed to the “identical” wrong (the one ex- 
posed in the false statements), when chance was allowed for. 
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This refers to answers originally correct on the initial com- 
pletion test. 

2. About one answer in seven was changed to an “identi- 
cal” wrong as exposed in the false statements when the 
question was omitted on the original test. 

3. The longer the interval the fewer the answers accepting 
wrong statements, thus suggesting that negative suggestion 
effects are relatively temporary. (The intervals ranged from 
one to thirty-four days.) After a month very few false 
impressions persisted. 

4. The number of changes to “identical” true responses 
outweighed the negative effects. Thus the net effect of the 
true-false was a positive suggestion phenomenon. (The 
differences in this case ranged from four to fourteen times 
their standard errors.) 

Summary. All in all. Miss Roberts's work is in substantial 
harmony with the results of Remmers and Remmers. This 
may be the more significant since Miss Roberts worked on 
the high-school level and Remmers and Remmers dealt with 
college students. 

The results of the two studies are alike in that both showed 
the net suggestion effect to be positive and not negative. 
Miss Roberts’s differences were, however, very much more 
significant statistically than were those of the earlier study. 
The principal difference is that Miss Roberts found slightly 
more negative suggestion operating, althou^ five per cent 
seems a reasonable allowance. 

Much more work is needed before these conclusions may 
be accepted as final. It is probably fair to suggest that a 
priori arguments, reasonable as they may appear to be, be 
discounted until the advocates of negative suggestion can 
bring forward empirical disproof of these two studies. It 
appears almost certain that the negative suggestion effect 
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of the true-false test has been over-rated in the minds of 
some critics. 

The disparity between experimental results and theoretical 
expectations may lie in the attitude of mind, the mental set, 
assumed by most pupils in taking true-false tests. The 
critical faculties are ordinarily more active in the true-false 
situation than they are in such activities as reading a lesson 
from a text, reciting orally, or writing a discussional exam- 
ination. 

Do pnpUs tend to respond “True” oftener than “False” 
on true-false tests? This question, although not closely 
related to the question of the negative suggestion effect, has 
some general interest for us in these days when the true- 
false test “hangs in the balance.” 

Fritz^ and others have claimed that pupils mark more 
statements “true” than they do “false,” this author report- 
ing the ratio to be about 62 :38. It should be noted that the 
test employed used technical material quite unknown to 
the students. However, almost identical results were 
obtained with materials studied as regular class exercise. 

The author has carried on a number of studies similar to 
that of Fritz. In one such, for 164 students taking an 
examination in educational psychology, the ratio of responses 
given as “true” to those marked “false” was but 52:48. 
There is therefore little agreement between these two 
studies. 

In another unpublished study by two studmits of the 
author (D. D. DurreU and C. L. Cushman), seventy-four 
students taking a true-false test in psychology were asked to 
indicate whether their responses were (1) matters of absolute 
certainty, (2) matters of imcertainty, or (3) pure guessing. 
The percentages of errors (based upon the numba: of at- 

»M. F. Fritz, “Guessing in a True-False Test,” Journal of Educational Psychology, 
VoL XVIII (1927), pp. 558-561. 
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tempts) were, respectively, 20.1 per cent, 33.6 per cent, and 
48.8 per cent. These students were mature persons and 
likely to follow instructions rather closely. The results 
indicate that pure guessing, when it occurs, is roughly a 
50:50 situation. 

Rutledge^ found that when the “same” items were stated 
in both true and false forms, they were equally difficult. 
This finding is not necessarily opposed to the common belief 
and finding that false statements are more difficult than true 
statements. Taking true and false statements as they come, 
it is entirely possible that the maker of the test should cause 
the false statements to be the more difficult. Weidemann, 
Rutledge, and others have shown that false statements tend 
to phrasings that are confusing by the use of double nega- 
tives, “trick” wordings, etc. 

Further conunent on these issues is withheld through a 
conviction that much more crucial experimentation is 
needed before drawing working conclusions. 

The suggestion effect of position of printed response 
words. Mathew^ has recently published evidence that, in 
two-response tests, the left-hand response is selected 3.2 per 
cent oftener than is the right-hand alternative, and that 
the upper alternative is chosen 33.8 per cent oftener than 
the lower. 

H. W. Meyer is at present, under the author’s direction, 
repeating Mathews’s work. Meyer’s results will be pub- 
li^ed later. 

There is one proposal in Mathews’s conclusions that needs 
comment, viz., that his solution calls for the alternation of 
the re^nse words, true and false, from one item to the 
next. This arrangement is illustrated at the top of the 
next page. 

iR. E. Rtitledge, **The True-False Examination in Elementary Psychok>^, with Siw- 
gestions for its Improvement,” Unpublished Ph. D. Thesis, Umversity of Cahfomia. (1927). 

^Journal of Educational Psychology^ Vol, XVIII (192*^, pp. 445-457. 
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1. Starch is a carbohydrate. True False 

2. Pepsin is an enzyme of the saliva. False True 

3. The gastric juice is alkaline in reaction. True False 

4. The gastric juice acts chiefly on proteins. False True 

Etc. 

Such an arrangement is very likely to be sufiSciently con- 
fusing to destroy any possible gains from the rotation of 
response words. Moreover, the author is unable to see that 
Mathews’s finding, if verified, has a great deal of significance. 
The argiunent runs as follows: 

1. It is tinlikely (and Mathews did not attempt to prove) 
that pupils who actually know a response will be mislead by 
such a fact as mere position of response. (Mathews, in fact, 
shows definitely that this tendency to response to position 
increases as the items became more difficult.) 

2. Assume that in all pure guesses the upper response is ’ 
chosen 33.8 per cent oftener, and the left is selected 3.2 per 
cent oftener (or even always). The R—W formula will 
take care of the situation, provided the test is constructed so 
that right answers occur equally often in upper vs. lower or 
left-hand vs. right-hand positions. Response by position is 
then equivalent to response by chance. There ^ould be no 
net gain or loss by such an extraneous method of responding. 
It should be noted that in scoring the Army Alpha tests 
during the late war there were many cases where soldiers 
responded without exception to one or the other of the words 
“true” or “false.” Such papers were arbitrarily marked 
zero on the assumption that R—W scoring would yield 
zero if equal numbers of true and false statements occurred. 
If fewer than 100 per cent were responded to systematically 
by either “true” or “false,” the effect is the same. 

Chapter summary. There is little or nothing in the 
discussions of this chapter that dare be taken as proved. 
The following statements are highly tentative: 
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1. The negative suggestion effect of false statements in 
true-false tests is probably much smaller than is sometimes 
assumed. 

2. The small amount of negative suggestion which has 
thus far been shown for true-false tests seems to be fully 
offset by net positive teaching effects. 

3. It is not established that students when in doubt mark 
statements “true” oftener than they do “false,” the evidence 
for this finding being quite inconclusive. 

4. Rutledge claims that the identical items are equally 
difficult when stated as true and as false statements. 

5. It appears that position of printed response words 
affects the answering, especially when the alternatives appear 
one above the other. Upper and left-hand responses are 
more frequently selected. This finding, if true, may have 
little practical significance. 



CHAPTER XIV 


EXAMINATIONS, MARKS, AND 
MARKING SYSTEMS 

Classification of marking systems. There are doubtless 
many hundreds of different marking systems in use in the 
United States if we include all the minor variations of a few 
common types. In a survey of 281 Illinois high schools 
Odell^ found nearly a hxmdred different plans in use. In 
spite of such diversity, almost all marking systems reduce to 
two general categories when we examine their fundamental 
logic and assumptions, viz., 

1. Systems based upon absolute (and usually subjective) 
standards, the most familiar example being the 100 per cent 
(or 100-point) scale. 

2. Systems based upon relative values, ranks, or the normal 
curve. 

The former are probably more common. Odell foimd 
them to be so in the ratio of about 3 to 1. 

In speaking of the common lOO-point^ or percentage 
scale as an absolute scale, the reference is to the existence of a 
standard in the mind of the marker and not to the use of 
numbers per se as marks. Numbers or letters are equally 
defensible since either must be defined before they can be 
held to take on meaning. The only possible advantage of the 
use of letters is that this practice may tend to discourage 
marking plans which call for more grades or degrees of 
achievement than are distinguishable by the human judg- 

1C. W. Odell. “High School Marking Systems,” School Review, VdL XXXIII C1925). 
pp. 346-354. 

*Wc shall refer to this plan as the “lOO-point acsle” in spite of the fact that there are 
really 101 poBsible marks from zero to one hundred, inclusive. 

36S 
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ment, aided even by the best objective measures which have 
yet been devised. The use of a numerical series like 1, 2, 
... 5 or 1, 2, ... 7 is fully as good as letter series of 5 or 7 
letters since 5 or 7 levels of ability are probably distinguish- 
able with a reasonable degree of accuracy. To use the 100- 
point scale for such purposes may be objected to as giving 
resulting marks an appearance of accxuacy which is chiefly 
spurious. The objection is precisely that of stating that a 
pupil's weight is 78.89 pounds when measurements taken 
five or ten hours apart might ^ow variations of a pound or 
more. In other words, to attempt to mark one pupil 78 
and another 79, while open to no theoretical objection, is to 
be severely criticized in the light of experimental findings 
to the effect that firom three to ten (if as many as ten) 
degrees of ability represent the maximum discrimination in 
judging human nature. 

THE PERCENTAGE GRADING PLAN 

The percentage grading plan analyzed and criticized. 
The strongest argument for the 100-point scale in grading is 
that this practice is rather generally familiar and imderstood. 
In reality the greatest weakness of this system is that its 
sheer familiarity leads to an imcritical acceptance without 
conscious regard to the inherent assumptions. To employ a 
scale of marks beginning with 0 and ending with 100 (101 
marks in all), implies either (c) that the user actually thinks 
that he is distinguishing 101 levels of pupil accomplishment, 
or (b) that he realizes the spuriousness of such an assumption, 
but through inertia or callousness to the accepted rules of 
science to the effect that results ^ould not be stated with 
apparent accuracy greater than the real accuracy, he con- 
tinues to use such a misleading scale of marks. 

One further aspect of the 100-point scale is worthy of 
comment, viz., the use of a passing mark (usually between 
65 and 80, most often at 70 or 75). Such a fixed passing 
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mark is anotha: proof of the absolutism of such scales. It 
takes no accoimt of the very important fact that the number 
of pupils reaching, exceeding, or falling short of 70 (or some 
other passing mark) is a function of the difficulty of ex- 
aminations, subjects, etc., or of personal “standards” of 
passing work. These criticisms, again, are based upon 
practical rather than theoretical considerations. 

We may now turn to a somewhat more detailed criticism 
of the percentage (100-pomt) scale of markings.^ 

1. The stated zero and 100 points of the 100-point scale 
are arbitrary, undefined, and inflexible. They are not 
defined in terms of any true scale of education^ abilities. 
The true zero and the true 100 on a scale of absolute achieve- 
ment will often fall long distances above or below such 
arbitrary limits, were true units of educational measure- 
ment at hand. The only thinkable zero point is that of 
“just not any ability” for a given school subject. What a 
trm “100” mear^ is conjectural. It means, a priori, “per- 
fection.” A more rational meaning is probably “as good ais 
can be expected when all factors of the situation are taken 
into accoxmt.” This throws the meaning of “100” quite 
upon a basis of subjective judgment, and gives it as many 
different meanings as there are teachers employing the 10(>- 
point scale. 

2. If the zero and 100 points are fixed quantities, all 
intermediate pomts are, by the same token, fixed, amd 
presumably the increments between successive integral 
values are equal. Thus, a pupil earning 71 in a school 
subject is exactly as much superior to one earning 70 as is 
one earning 98 to one receiving 97. As every one admits, 
such a strict interpretation is nonsense. Yet such are the 
implications of the use of the scale. 

iThere are some thmkers who object to brandiog the 100-point scale as a percentage 
scale. A little thought will show that such objections are quite pointless. If we assume 
a scale of marks begmning with 0 and ending with 100, it is entirely defensible to brand such 
a series “per cents.^' Ihey are in effect su<* from mathematical considerations. The issue 
is something like that of the question whether Shakespeare wrote the plays commonly 
ascnbed to him or whether these were written by another man of the same name. 



372 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

3. As has been mentioned, the 100-point scale, if taken 
literally, implies 101 distinguishable differences in accom- 
plishment. 

The three foregoing criticisms are based upon a strict 
interpretation of the theoretical implications of the per- 
centage scale. To these we must add two other criticisms 
which follow from the practical considerations in the use 
of the scale. 

4. The use of an arbitrary passing mark (70 or some other 
value) is without adequate defense, and in its actual opera- 
tion results in throwing the distribution of actual marks 
given into a skew distribution quite at variance with the 
probable facts. Judging from a mass of accumulated evi- 
dence, pupils of any school grade distribute themselves 
approximately normally, i.e., in rough accordance with the 
curve of chance, which in turn seems to hold reasonably 
well for most phenomena of pure biological and psychological 
variation. If some mark in the general vicinity of 70 is 
taken as “passing,” the vast majority of pupils will be 
marked between 70 and 100. The average will ordinarily 
fall in the eighties of such a scale. Assume for purposes of 
discussion that the average will fall at or near 85, a reasona- 
ble approxitnation to the facts. This means that super- 
average pupils distribute themselves along a 15-point scale 
(between 86 and 100), and sub-average pupils arrange them- 
selves along a 85-point scale (between 0 and 84). Such a 
condition is, to say the least, illogical. As a matter of fact, 
we have been so disturbed by such a lop-sided scheme that 
we have consciously or unconsciously attempted to better 
conditions by the very simple, and quite illogical, expedient 
of ignoring the lower 50 points of the 100-point scale, except 
in rare instances. Examination of marks as actually given 
from teacher to teacher and from school to school is tte best 
possible evidence that the 100-point or percentage scale 
does not work in practice. 
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If we are to grade upon a basis of 0 to 100, there is one and 
only one logical average mark, viz., 50. Otherwise we tend 
to force our marks into a skew distribution which is unlikely 
to represent the facts. To find any sure basis for a passing 
mark (which, by our logic, must lie between 0 and 50) is a 
quite impossible task unless we resort to arbitrary definition 
and selection of some point, as, say, 25 or 30. 

There is no intention of arguing that marks should be 
distributed exactly normally. On the other hand, a distri- 
bution like that shown in Figure 12 is much less reasonable 
than one like Figure 13. 



Fig. 12. — Showing the skewness and general form of the distributions of 
grades usually observed when the lOO-point scale is applied* 



Fig. 13. — Showing a more rational use of the 100-point ajrading plan, and 
one more in harmony with the known facts about individual differences. 


5. The fifth objection which we shall mention about the 
lOO-point grading scheme is that a large body of experi- 
mental evidence points to the fact that fircan but five to 
seven levels of ability are ordinarily recognizable by teachers 
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in marking pupils. This is again a practical objection and 
one resting upon empirical evidence. The difference between 
an “85" and an “86” is a difference at least five times as 
fine as the human judgment can ordinarily distinguish. 

Before leaving ttds treatment of the 100-point or percen- 
tage system, it is well to extend the argument to cover such 
systems as the following: 

A equals 95 to 100. 

B equals 85 to 94. 

C equals 75 to 84. 

D equals 70 to 74. 

E equals below 70 (failure, condition, etc.) 

In such a case there is no objection to the use of five letters 
as marks. The objection consists in arriving at these letters 
through the use of percentages which imply that the basic 
marking is done on the scale of one hundred. To arrive at 
the letter marks directly, assuming relative achievement, is 
of course a different situation and probably more defensible. 

GRADING BY THE NORMAL CURVE 

The normal curve in marking. The second general type 
of marking system emplo3rs the idea of the normal curve or 
probability integral. This method is sometimes known as 
the “Missouri Plan” because it was developed by Professor 
Max Meyer of the University of Missouri. In contrast with 
the scale of one hundred, the normal curve idea implies 
marking upon a basis of relative, not absolute, achievement. 
The underlying logic is that of chance or probability. It has 
been noted empirically that chance phenomena like the 
repeated tossing of a number of pennies tend to form a 
symmetrical bell-shaped curve something like that ^own in 
Figure 13. The essential characteristics of such a curve are: 

1. Most of the cases center rather closely about some 
central value or average. 
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2. Large deviations from such an average occur less often 
than small deviations. 

3. Deviations of a given magnitude are equally likely in 
either direction from the central value. 

4. The expectancy is unlimited at either extreme. In 
other words, the curve never, in theory, quite reaches the 
base-line but extends indefinitely in both directions. (The 
mathematician calls such curves as 3 miptotes.) 

Such curves summarize rather well man y phenomena of 
biological variation. Since the distribution of mental 
abilities may quite logically be regarded as biological 
phenomena in their essential nature, it is not at all surprising 
that the study of individual differences to date has 3 delded 
many distributions which are sufficiently normal to justify 
statistical treatment upon the assumption of normality. 
At the same time it must never be overlooked that the 
normal distribution need not fit such data at all closely. To 
speak of the distributions of a random group of pupils with 
respect to general intelligence or ability in geography as 
“normal,” really implies nothing more than a conviction 
based upon experience that such abilities appear to be 
distributed in much the same manner as we commonly 
observe to be the case with chance phenomena. Very few, 
if any, mental traits have been found to be distributed in a 
manner markedly at variance with the normal am. 

To return to the four characteristics of the nomS curve 
as listed above, we find that they, individually and collec- 
tively, summarize pretty well our own experience and 
observation on the distributions of ability as found for the 
various school subjects. At any rate, there is little reason 
to believe that the assignment of marks upon a basis of the 
normal curve will introduce errors which are large comi)ared 
with known errors in markin g by other common schemes. 

The real difference, for present purposes, between the 
100-point scale and the normal curve rests in the fact that 
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the former assumes absolute standards of judgment, and the 
latter attempts nothing more than relative judgments. The 
former attempts the precision of physical measurements with 
their exactly defined units such as inches, hours, meters, 
pounds, etc. The latter attempts merely to arrange pupils 
in general order of merit; i.e,, it is essentially a ranking 
process. 

Under plans based upon the normal curve there is no 
attempt made to brand a pupil as “80” (or 80%) or as “87” 
(or 87%) with 100 as a point of reference. Instead, pupils 
are marked A, B, C, . .. or 1, 2, 3 ... to foiur, five, six, 
or seven (occasionally more) letter or numerical marks. 

Marking schemes are essentially matters of definition. 
It is necessary at this time to interrupt the discussion of 
ma rking schemes momentarily to bring home certain facts. 
After all, any marking scheme is arbitrary. Without defini- 
tion it can have no meaning. If a school system is to develop 
a defensible plan for evaluating pupils’ accomplishment, 
two things are absolutely essential, viz., 

1. The pupils must be pUaed in correct, relative positions or 
ranks with respect to each other; and, 

2. The adopted marking scheme must be defined. Its sole 
meaning and value rest upon its definition to pupils, teachers, 
and parerOs alike. 

With respect to the first of these requirements we need add 
little to the discussions of preceding chapters. This entire 
volume has held constantly in the foreground the need for 
vaUd and reliable measurement. If an examination, a 
series of examinations, or the composite evidence upon which 
marks are determined rank the pupils in the approximately 
correct order of achievement, it matters little what form the 
final statement of marks may take. The rest of the question 
is largely a matter for local decision. No markiTig system 
can be any more valid than is the underlying r anking of the 



EXAMINATIONS AND MARIQNG SYSTEMS 377 

pupils. It is doubtful whether educational measurement 
thus far has risen much above the plane of rank-orders. If 
it has, the standard test alone reaches such a refinement. 
If the teacher has an adequate basis for evaluating her 
pupils in rank-order, almost any marking system, if defined, 
is a good one. If the basic data for marking pupils are 
invalid or unreliable, no marking plan will introduce the 
slightest refinement into these basic determinations. 

This leads us to our second point, viz., that, granted 
approximately correct relative evaluations, the exact mark- 
mg system r^uces to a pure matter of definition of terms. 
An “A” caimot possibly ^ve any other meaning than some- 
thing analogous to “that level of achievement which, in a 
large and imselected group of pupils, is attained by the best 
five (or some other arbitrarily chosen) per cent of pupils.” 
Inde^, until we have some exact unit of educational 
measure (equivalent to the physicist’s foot-pound or watt- 
hour, etc.), what other meaning can a mark have? It is 
the author’s contention that any teacher who can say, with 
proof, that Henry Jones is the best pupil in the cleiss, Mary 
Ellen Brown stands second, Marion Smith is third, and so 
on down to Franklin White who stands last, need not con- 
cern herself greatly whether her pupils are marked with 
letters, numbers, words, or what. The exact and final re- 
cording of the facts at hand will be nothing more or less than 
a definition of practices. The normal-curve idea, the per- 
centage plan, or any other scheme is fundamentally no 
better nor no worse than the relative rankings possible by 
means of the evidence at hand. For the present, education^ 
measurement is rank measurement, not measurement in the 
physical sense. Almost any scheme of recording marks, 
provided it be adhared to by aU teachers in the same school 
or school system, and provided further that it be understood 
by all concerned, will prove adequate if there is a valid and 
reliable provision for the measurement of the rdative abili- 
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ties of the pupils to be graded. At the same time, the de- 
finition of local practices is essential in order that there be 
meaning to the feal marks given. 

It is unfortunate that diversity exists in marking systems 
from city to city and firom state to state. A uniform plan 
■would have many advantages. On the other hand, a uniform 
marking plan tluoughout the United States would have 
certain inaccuracies, since we have reason to believe that 
there are variations in the quality of school work from one 
locality to another. 

Normal distributions of school marks. Assuming that the 
better marking systems are those employing rdative rather 
than absolute standards, of what use can the normal curve 
be? In the first place we should disabuse ourselves of the 
idea that the normal curve (or any other mathematical 
concept) will tell us exactly how many pupils should receive 
A, B, C, etc. The di-sdsion of a group of pupils among the 
several letter or number marks is largely a matter of defibii- 
tion, but not wholly so, as we shall see. If we assume that 
normality of distribution is sufficiently true to the facts of 
human nature, the normal curve does lead us toward certain 
decisions. The fact that observed distributions of mental 
abilities are ordinarily bell-shaped, bunched in the middle, 
and tailed out gradually tov^d the extremes does furnish a 
rough guide to the relative assignments of different marks. 
Thus, to adopt a grading system like the following would 
hardly be defensible in the light of all known facts. 

Letter mark A B C D E 

Per cents obtaining 20 25 40 10 5 

Such a plan runs counter-current to the probable sym- 
metry of the real abilities of a large group of pupils. A visi- 
bly better plan would be: 

Letter mark 

Per cents obtaining 


,A B C D E 

5 25 40 25 5 
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We are left with a rather difficult problem of deciding 
exactly how the various letter or numerical marks should be 
distributed. In the absence of any better basis, students of 
this question have gradually come to think that the pro- 
portions may well follow those to be expected in chance 
phenomena. If we can defend such a position, there is a 
simple mathematical approach to the solution of our problem, 
viz., through the binomial theorem. 

If we toss four pennies for a large number of times, we 
discover that all possible outcomes fall into five categories. 
Moreover, each of the five categories tends to show roughly 
fixed proportions of relative occurrences. We speak of these 
relative proportions as the expectancies or probabilities. 

Calfiog ff heads and T tails, by expanding (JH+Ty we get: 

The coefficients of the five successive terms (categories) 
are 1, 4, 6, 4, and 1, summing to 16. Expressing the coeffi- 
cient of each term as a fraction of 16, we can build up a 
table like the following: 

Category Probability of Occurrence 


Four heacfe — no tails 

Iinl6 

or 

about 

6% 

Three head&— one tail 

4 in 16 

or 

about 

25% 

Two heads — ^two tails 

6 in 16 

or 

about 

38% 

One head — ^three tails 

4 in 16 

or 

about 

25% 

No heads — ^four tails 

1 in 16 

or 

about 

6% 


If we should plot these expected results as a graph, it 
would be noted that the results run more or less parallel to 
the characteristics of the normal curve, i.e., a heaped sym- 
metrical distribution which tails out gradtiaUy tow^ ather 
end. Sudfi a distribution is not normal for a numba* of 
reasons. However, if the interested student will expand the 
binomial {H-yTY for successive values of «, plotting each 
esqiansion as a curve, he will probably be convinced that the 
limit of the binomial, when n becomes infinite, is the curve 
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which we have discussed under the name of the normal 
curve or probability integral. 

Below are given the expansions of the expression {H+T)” 
for values of n between 2 and 6. Below each term is given 
the coefficient of that term expressed as a fraction of the 
sum of the coefficients of all terms. 


(.H+T)^=IP+2HT+-n 

i f i 

(ff -f r)» =/P-f 3H»7’-i-3Hr*+ z» 

i f f i 

(fT-t-D* 4fl'p+r< 

iV A' 'h -h 
^ M ^ 

■it ^ It M- a 

If we collect these results as a table, changing the stated 
fractions to per cents and denoting the categories by the 
letters A, B, C, etc., we have the following as the approxi- 
mate expectancies of the occurrences of each letter-grade 
upon the assumption of chance distribution of pupil abilities : 


No. OF Letter 
Grades 

EMPLOY'ED 


Letter Grades 


A 

B 

c 

D 

E 

F 

G 

25% 

12% 

II 

2% 

50% 

38% 

25% 

16% 

9% 

25% 

38% 

38% 

31% 

23% 

12% 

25% 

31% 

31% 

6% 

16% 

23% 

3% 

9% 

2% 


3 

4, 

5, 

6, 
7 


The per cents in the foregoing tabulation have been rounded 
off in such a feshion as to keep the sum in each series one 
hundred per cent. Note that if n equals the number of 
letter-grades to be employed, the expectancies are found 
from e^anding the binomial to n— 1 as the exponent. 
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That such distributions are roughly the nonnal expectancies 
as well can be determined by consulting any table of the 
nonnal curve or probability integral. Any textbook on 
statistical methods gives such tables. 

Limitations of normal distributions of marks with small 
classes. Few persons will find any serious quarrel with the 
above distributions of marks as a matter of pure theory, 
since we have attempted nothing more than a definition of 
terms upon the assumption that pupils distribute themselves 
according to chance, and that, if this be the case, theorems 
of probability fumi^ rough guides to a defensible marking 
plan. The derivation of these relative proportions is based 
upon one very imjrortant additional assumption, viz., that 
n is very large. This means, in effect, that suA distributions 
of markings may be held to be fairly valid for large numbers 
of pupils, but that they must not be applied too literally to 
small groups. In general, it is unwise to apply such systems 
in a purely mechaiucal fashion if fewar than one hrmdred 
pupils are to be marked as a unit. Even with a few hundreds 
of cases, some injustice is likdy at times, althou^ such 
errors are probably very slight in comparison with numer- 
ous other non-eliminable and ever-present errors in grading 
such as subjectivity, tmreliability of examinations, etc. 

It is generally known that the average accomplishment of 
typical classes of ten to fifty pupils vary somewhat from year 
to year. Occasionally very large variations will occur, al- 
thou^ these are the exceptions, not the rule. Teachers 
generally are prone to over-estimate the amount of variation 
to be expected in the average abilities of successive classes. 
To illustrate the probable amounts of variation from one 
class to the next, we may assume certain facts: 

1. That each class contains thirty-six pupils. 

2. That the same objective and rather reliable test was 
given to each successive class. 
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3. That the average score of the first class on the test men- 
tioned was 90 out of a possible 150 points- 

4. That the standard deviation of the marks was approxi- 
mately 30, and the probable error was approximately 20. 
(See Part IV for the meaning of these terms.) For present 
purposes the meaning of the probable error will be sufficiently 
defined if we assume that within 20 points on either side 
of the average (90), i.e., between 70 and 110, roughly fifty 
per cent of the scores fall. 

Our problem is to decide within what limits the averages 
of successive classes will probably fall. To do this we need 
what is called the probable error of the average. The usual 
formula is: 

^^(Average) 

Substituting the assumed values, we have: 

20 20 
^^(Average) 6 

Upon the basis of our calculation we may conclude that 
the chances are 50:50 that next year’s class (of thirty-six 
pupils) will diow an average score on the same test within 
the limits of plus or minus 1 PE, i.e., between 86.7 (90—3.3) 
and 93.3 (90 4-3.3). Without troubling to explain the 
mathematics of the situation, we can say: 

The chances are 50 in 100 that the average of any succeed- 
ing class will fall between 86.7 and 93.3 

The dumces are 82 in 100 that the avoage of any suc- 
ceeding class will fall between 83.4 and 96.6 

The chances are 98 in 100 that the average of any suc- 
ceeding class will fall between 80.1 and 99.9. 

There is thus but about one chance in fifty that the average 
of a succeeding class will depart by as much as ten points 
in either direction from the average of the first class. It may 
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seem that ten points is a large departtire. Just how large it 
is may now be estimated. We started with the assumption 
that the middle fifty per cent of the class scored between 
70 and 110, a range of forty points. In such a case the total 
range of scores of the individual pupils of the class would 
hardly be less than 100 points. The average from class to 
class would fall roughly ninety-eight times out of one hun- 
dred between 80 and 100, or within a range of twenty points, 
which is about one-fifth of the difference between the best 
and poorest puph in the class. About fomr times out of five 
(82 in 100) the average, fi-om class to class, would fall 
between 83 and 97 (83.4 and 96.6 to be more exact), a range 
of fourteen points, or about one-seventh of the difference 
between the highest and lowest score in a given class. 

Without tr 3 dng to minimize the seriousness of such fluctua- 
tions in averages, it is necessary to call attention to the fact 
that, after all, when the final marks of the class are distrib- 
uted upon a five- or seven-letter grade basis, it is easy to 
exaggerate the seriousness of variations from class to class. 
We should attempt some method of making allowance for 
differences in the average achievements of small classes. 
This question will be the next issue to be discussed here. 

Proposals for allowing for variations arising from small 
samplings of pupils. A given class of twenty or thirty pupils 
may well be regarded as a small sampling of the pupils of a 
number of successive classes. Many teachers have objected 
violently to the use of the normal curve in grading small 
classes, upon the argument that such marking is too mechani- 
cal and arbitrary. Such teachers wi^, with much justifica- 
tion, to use their judgment in departing firom the adopted 
proportions of the letter-grades agreed upon as the basis of 
Tnarlnng in that particular school. With such a desire the 
author is in considerable accord. Nevertheless there area 
number of considerations which must be kept in mind. 
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1. In the absence of demonstrated departures of a given 
class from typical conditions, the teacher should be very 
cautious about departing far from the adopted scheme. 
Human judgments are so fallible that small variations can- 
not be detected unless rather exact and objective measure- 
ments are employed. Large variations will ordinarily be 
open to casual inspection; small variations will not. It is 
probably a safe rule to adhere to the adopted plan unless the 
judgment of the teacher is supported by facts obtained from 
the use of well-constructed tests, standardized measures, or 
intelligence testing, etc. 

2. While variations in classes occur from year to year, it is 
unlikely that these occur constantly in the same direction. 
Over a period of years the pooled marks should approximate 
fairly closely the adopted scheme, even when certain classes 
are allowed to depart noticeably from the plan in use. 
More times than not, teachers fall into the habit of “always 
having superior (or, less often, inferior) classes”; i.e., each 
year they diverge in the same direction from the adopted 
plan. This phenomenon may be a subtle form of self-flattery 
in many instances. The teacher who keeps no cumulative 
records will nine times out of ten either (a) use the grading 
system in an arbitrary and madianical fashion, or (6) show 
systematic and continual departures from the adopted 
scheme. Neither of these outcomes can be defended. 

3. The dangers firom a too-rigid adherence to any grading 
plan with small classes are undoubtedly real ; at the same time 
variations in classes may be rather insignificant in the li^t 
of other sources of error in marking. In particular it is to 
be noted that no grading plan will make matters markedly 
better or worse than the inaccuracies present in the bases by 
which the marks are decided, whether these be recitations, 
written work, examinations, standard tests, or what. If 
there is an adequate basis for marking pupils so that they 
are arranged fairly reliably in order of achievement, almost 
any grading S 3 ^tem will work reasonably well. 
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4. The fundamental issue is to provide objective, valid, 
and reliable instruments for determining relative ability. 
This is essentially a matter of ranking pupils in order of 
relative merit. Secondarily there is the problem of changing 
these rankings into some literal or numerical system of 
marks. This second step cannot introduce any refinement 
into the first and more fundamental consideration. Then we 
can attempt, more or less successfully, the third problem of 
allowing for the effects of small classes. 

5. Any grading plan is largely a matter of definition. It 
is a “gentlemen’s agreement” to play the game by the same 
rules. A teacher who, year after year, departs systematically 
from this agreement is a “poor sport”; she may be justly 
accused of willfully refusing to co-operate in defcnng marks 
in the minds of pupils and parents. To excuse the practice 
on such groimds as having “high” standards (these teachers 
never apparently have “low” standards), or that she gets 
better results firom her pupils than do o^er teachers in the 
same school, only puts the teacher in a bad light, no matter 
what the facts may be. 

We may now consider briefly certain proposals for allowing 
for variations in average abilities of small classes firom one 
year to the next or from one subject to the next. 

First Proposal. In stating the approximate percentages 
to be given each letter grade, allow some latitude for the 
teacher’s final judgment. Thus: 

ABODE 
5%-10% 20?^r-30% 35%-45% 20%^% 5%-10% 

Although some latitude is essential in order to avoid 
slavish adherence to the scheme, there is always the danger 
that teachers will abuse the privilege and depart systemati- 
cally in one direction or the other. There is the further 
objection that the mere allowance of latitude does nothing 
at all toward providing some defensible basis of setting the 
scale to fit particular classes. 
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Second Proposal. Check the direction and amount of 
variation or selection present through the administration 
of an intelligence test. This is, in part, the recent proposal 
of Ellis, ^ although many earlier advocates of this idea are 
to be found. Although this method has undoubted value, 
it is open to several objections: (a) the extra cost, (&) the 
extra labor, and (c) most important, intelligence is often 
a poor index to success in certain school subjects, not to 
mention the possibility of classes working well above or 
below the indications of their intelligence ratings. It is 
unlikely that teachers generally will be convinced that 
Ellis's plan is the best solution in sight, although this plan 
has its merits. 

Third Proposal. Administer a good standard test as a 
check on your assignment of letter grades. S3mionds^ favors 
this plan, and he has set up the needed machiaery for hand- 
ling the method with a number of high-school tests.* Sy- 
monds’s plan is probably much to be preferred over that of 
Ellis, as it represents a direct attack on the problem, and he 
has worked out convenient tables for handling many high- 
school test results. No such data are available for elemen- 
tary-school subjects. Moreover, it is only fair to state that 
Symonds's work is not directly aimed at our present pro- 
posal since he is primarily concerned with a method of 
transmuting standard test scores into grading units. His 
tables do furnish valuable checks upon such questions as the 
variation and selection present in a small class. 

Fourth Proposal. The author has reached a somewhat 
different solution of the question of allowing for variation 
and selection in marking successive classes or different sec- 
tions of the same class. The real problem is that of throwing 


iR. S, Ellis, Standardizing Teachers’ Exarmnalums and the Distribution of Class Marks 
(Bloomin^n, Illinois: Public School Publishing Company, 1927), pp. 11^114. 

*P. M. Symonds, Measurement in Secondary BducatUfn (New York: The Macmillan 
Company, 1^7), pp. 507-529. 

*P. M. Symonds, Ability Standards for Standardized Achievement Tests in the High 
School (New York: Teachers College, Columbia University, 1927). 
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successive (and possibly varying) classes or sections upon a 
single scale so that the direction and amount of departure of 
any class or section from the central tendency of many such 
groups may be noted. This, in the author’s opinion, may 
be done with a degree of accuracy commensurate with the 
needs of the situation through the construction and use of 
duplicate or equivalent examinations. The building of two 
or more forms of the same examiaation so as to capitalize 
on the equalizing force of chance was described in Part II 
of this volume. In brief, the plan is to prepare hundreds of 
items and then to deal them by chance into from two to five 
piles, depending upon the number of duplicate forms ad- 
judged necessary. If each resulting form contains at least 
one himdred items, the averages of such chance forms will 
ordinarily differ by not more than from two to five score 
points, an amount of inequality of no great practical conse- 
quence in view of the other unavoidable errors present in 
any marking scheme. 

If there are several sections of the same class, different 
tests or examinations may be given to each, and yet the 
resulting scores may be thrown into a single distribution for 
purposes of determining marks. Even if the classes are 
sectioned upon a basis of ability, the duplicate examinations 
may be planned so that the pupils from aU sections are 
placed upon a single scale of measurement. More will be 
saiid later on this point. 

When there is but a single section a semester or a year, the 
present proposal offers no assistance at the outset. Over a 
period of two or three years the effect is to generate a set of 
local norms. As time goes on, these norms become more and 
more accurate if cumulative records are kept. The proposals 
of Ellis and Symonds, particularly the latter, may at the 
outset be used to advantage, but will gradually become 
unnecessjiry. Our proposal reduces to the very simple 
proposition of avoiding the pitfalls inherent in the grading 
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of small classes by pooling successive sections or classes 
until the resulting numbers are large enough to be viewed 
as stable samples. The principal advantage of this proposal 
is that it is cheaper and less laborious than the plans advo- 
cated by Ellis or Symonds. 

Before closing the section on the values of duplicate forms, 
we must anticipate a possible misunderstanding. The reader 
may say that, after all, his or her regular practice of building 
a new test each semester is exactly equivalent to the present 
proposals. Not at all! It is to be hoped that the author has 
been sufficiently lucid to dispell such a misunderstanding. 
The construction of two, three, or more forms of an exam- 
ination at one time is essentially different from the production 
of the same number of “forms” at different times. Simul- 
taneous duplication capitalizes on the equalizing force of 
chance (or chance plus judgment); successive duplication 
with an3rthing like approximate equality is an almost im- 
possible task imless extended experimentation is resorted to. 

The problem of allowing for the variation and selection 
inherent in small classes is aggravated by the practice of 
sectioning classes upon a basis of ability. The next section 
comments upon the marking of such sections. 

The marking of classes sectioned upon a basis of ability. 

It is now common practice to divide large groups of pupUs 
into ability sections. These are often designated as “X,” 
“Y,” and “Z” groups. The basis for such sectioning is 
most often the scores from mental or educational tests. 
When sectioning upon a basis of ability is used, the sections 
are likely to present the situation shown in Figure 14. 

Figure 14 shows two significant facts: (a) that the average 
abilities of the three groups are quite different, and (b) that 
there is much overlapping of the abilities of the three groups. 
In view of these facts, we must make some allowances in 
our grading scheme for such differences in ability. 
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Practices are sharply divided upon the issue of marking 
superior, normal, and sub-average sections of the same 
subject. Two general plans are used. 



Fig. 14. — Showing the effects of sectioning a group of pupiis into three 
ability groups. 



E C 


Fig. 15. — ^Showing the effect of marking classes sectioned upon a basis of 
ability when a single standeird of marks is employed. The upper 
distribution diows the marks of tiie total group (the three sections 
pooled) and the lower distribution shows the marks for the X, Y, and 
Z sections separately. 


Plan I. Mark all sections upon a single basis. This 
means that the X (superior) section will receive a larger 
proportion of high meirks than will the Y (average or normal) 
section, and, in turn, the average section will receive a larger 
number of high marks than will the Z (sub-average) section. 
Figure 15 shows roughly the effect of ^s plan. The upper 
distribution shows the marks which might be given to the 
total group upon a basis of five letter-grades. The lower 
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distribution shows how the marks might be apportioned after 
the total group has been broken into three sub-groups on a 
basis of ability. In drawing these curves it is assumed that 
the sectioning reduces the abilities of each section to a range 
of three letter-grades (in contrast with a range of five letter- 
marks for the total group). 

Plan II. Mark each section on a different basis; i.e., make 
the marks for each section show merely relative accomplish- 
ment within that section, thus ignoring the fact that the 
three sections represent different levels of ability. In this 
case all sections would receive equal numbers of each letter- 
grade. Some schools follow such a plan, but designate the 
letter-grades with subscripts to indicate the section in which 
the mark was earned thus: Ax, Ay, A^, or By, B,, etc. 
It is obvious that an A or a B does not have the same mean- 
ing in a Z section as it does in an X section. There are just 
as many standards of marking under Plan II as there are 
ability groups or sections. It should be noted that this plan 
requires (logically) that there be as many failures among 
the brightest pupils as there are among the dullest ones. 

The choice between Plans I and II is an arbitrary one. 
There is no doubt that the first plan is superior from the 
standpoint of the theory of measurement. It represents 
measurement by a single, fixed, and defined standard in 
something of the sense of measurement in the physical 
sciences. On the other hand. Plan II offers important 
educational advantages in that it is less likely to arouse the 
conviction (in Y and Z sections) that high marks are almost, 
if not quite, impossible. Many teachers feel that a pupil 
doing good work, relatively speaking, in a slow (Z) section 
should be allowed to earn an A or a B exactly as if he were a 
member of a fast (X) group. There is something to be said 
in favor of such a view from the standpoint of motivation, 
but there are certain weaknesses in this plan that should be 
considered. In the first place, the more able students some- 
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times become aware of the facts and object to a situation 
which makes them do, perhaps, twice as much work to earn 
an ^ as is the case with a pupil in a Z section. In some 
instances students have been known to “lie down” in taking 
the tests used for sectioning in order that they may earn a 
berth in a slower-moving section. Again, schools need at 
times to make further evaluations of pupils, e.g., in recom- 
mending high-school graduates to college registrars or to 
prospective employers. Thus far many schools employing 
Plan II have not yet shown the courage of their convictions 
by granting differentiated diplomas or certificates of com- 
pletion. In the State of California only A and B grades 
in high school are “recommending grades” to first-class 
colleges and universities. Such A’s and B’s, of course, can- 
not be earned in Y or Z sections. Under the California plan 
it is inevitable that high-school pupils will become aware of 
the significance of ability grouping. Viewed from any angle 
it seems impossible to employ Plan II without ultimately 
“laying the cards on the table” so that an A or a B may be 
construed by all concerned as having meaning only with 
reference to the section (ability grouping) in which it was 
earned. Pupil, parent, teacher, and employer will have to 
know the facts. When this condition exists. Plan II will 
lose much of the advantage claimed for it by way of moti- 
vation. 

Plan II suffers in comparison with Plan I in that it requires 
separate examinations for each section. Under Plan I 
uniform examinations may be set for aU three (or other 
number of) sections. In fact, uniform examinations for all 
ability groups are essential if all pupils are to be marked upon 
a single scale. Such tests or exeiminations must be con- 
structed so as to include the basic contents of the instruction 
in aU three sections. Examinations of this nature are quite 
practicable. It is only necessary to include, as the first few 
items of such an examination, questions which will be passed 
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by all slow or Z section pupils, and then to proceed by degrees 
to increase the difficulties up to a final point where the items 
are failed by most or all of the fast or X pupils. Some teachers 
will object to this idea without realizing that this procedure 
is exactly what occurs in the case of any standardly mental 
or educational test. It is true that the Z sections will be 
exposed to test materials which have never been taught. 
This is not a fundamental objection in comparison with the 
obvious advantage of having all sections evaluated upon a 
single scale of accomplishment. In no other way is it possible 
to throw X, Y, and Z groups into direct comparison, one 
against the other. 

MEASUREMENT AND RANKING 

Measorement as rankings. B^ore any norm need be con- 
sulted, b^ore any interpretation need be sought, and b^ore 
any sort of evaluation is justified, we must first establish the 
fact that the test employed is capable of arranging the pupils of 
a class in a valid and reliable rank-order of ability. 

Any test, no matter how worthless, yields a series of scores 
for the members of a group of individuals. The fact that a 
wide range of scores is obtained is quite without meaning 
unless certain facts are first established. Very pretty sets of 
scores for a class of pupils may be obtained by “shooting” 
dice or tossing pennies. Giving a test is something like a 
horse race. Some horse always wins, and the rest come in 
second, third, fourth, etc., until the last horse crosses under 
the wire. The question is: “Did the best horse win?” 

A good test arranges the pupils in the approximate rank- 
order of ability. There is little error in the ranking. A 
worthless test {zero reliability) arranges them in an order no 
belter nor no worse than a lottery which is “on the level.” 

We thus have a picture of the two extremes of the scale 
of reliability, a perfectly reliable test and a worthless or 
entirely unreliable one. The former (if such existed) would 
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3 deld a set of scores for a class such as would justify each 
and every score being taken at face value. If the highest 
score were 89, the next 86, the next 77, and so on, we could 
say without danger of error that the pupil obtaining 89 was 
best, the 86-pupE was next in ability, the 77-pupil was third, 
and so on; all this assuming that the test was perfectly 
reliable. Suppose on the contrary that the test was abso- 
lutely unreliable (zero in reliability). Any set of scores 
obtained would be no more accurate than having the pupils 
draw the same numbers from a hat. 

In actual practice, tests are seldom near zero in reliability, 
and they usually are far from perfect in reliability. It 
follows that all test scores contain errors in greater or less 
degree, or as test-workers say, the test is fallible. A fallible 
test is the only kind the teacher can ever hope to administer. 
It follows that every test score must be taken cum gram 
salts. If we arrange the pupils of a class in rank-order upon 
the basis of any fallible test, it is certain that some pupils 
will be interchanged in rank. The best that we can hope 
for is that the displacements will not be very large in any 
single case, and that the average displacement will be as 
small as possible. 

Consider now a possible situation. It is near the end of a 
school term. The teacher is beginning to think about mak- 
ing a final examination, the results from which will enter 
importantly into the final grades. She decides upon a certain 
kind of test, old-type or new-type; it matters little. The 
questions are phrased, written down, and administered. The 
test is scored, and a wide range of scores is obtained. What 
do the scores mean? 

Before attempting to suggest an answer to this question, 
let us propose an innovation to the examination practice 
for this year. Suppose the teacher should decide to build a 
second test equivalent to the first so far as she is able to 
judge. To verify the results of the first test, she gives the 
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second one which she has constructed. The results are 
scored and tabulated, pupil by pupil, alongside the first 
set of scores. They disagree! In places the differences are 
disturbingly large. At other times they are too slight for 
serious attention. 

The teacher now calculates the averages on the two ex- 
aminations. One shows an average of 80 and the other of 71 , 
a difference of nine points. ^ Also, the range of scores in the 
first test was from 72 to 95, and in the second from 56 to 91. 
Thus, both the average and the variability (range) differed 
in the two instances. 

What would such a result mean? 

It is apparent that the numerical scores are seriously upset 
by the two facts already mentioned: the differences in 
averages and the differences in variabihty. Would the 
ranks be as markedly disturbed by such facts? Not neces- 
sarily so. In fact, a few minutes’ thought will probably 
convince the reader that rank-orders are less subject to 
inequalities from one equally valid but tmequally difficult 
examination to the next th^ are the absolute numerical 
scores. If this be true, there is an important lesson for us 
here in evaluating pupils' achievement. 

An actnal exan^Ie. We can now turn our attention to an 
actual example. Two examinations were given. So far as 
inspection could show, they were equally difficult. One was 
given on Monday, the other on the following Wednesday. 
Each examination contained 100 objective items. Table 73 
shows the results, together with certain computations. 

Discussion of Table 73. The conditions of the e:q)eriment 
must be kept in mind as we study this table. In the first 
place, this experiment differed from the ordinary classroom 

iThis figure is chosen because it represents about the average difference found in a series 
of similar double-examinations studied experimentally by AlcGregOT and R. uch . See 
Objective Examination Methods in the Social Studies (Chicago; Scott, Foresman Com- 
psmy, 1926), especially pp. 6-12. Or, see Chapter III of the present volume for an abstract 
of McGr^or’s study. 
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situation only in the fact that two examinations were given 
instead of the usual one. 

These two examinations were equally carefully construct- 
ed, and so far as the teacher could judge, they were equally 
valid and reliable. They were made at an interval of several 
days, and, of course, there was no way of making them 

TABLE 73 


Scores and Ranks of a Class of 20 Pupils on Two Objective Tests 
Given on Two Different Days 


Pupil 

Score on 
Test I 

Score on 
Test II 

Differ- 

ence 

Rank on 
Test I 

Rank on 
Test II 

Deff. in 
Rankings 

1 

38 

56 

18 

20 

18 

2 

2 

60 

76 

16 

15 

14 

1 

3 

76 

88 

12 

9 

8 

1 

4 

89 

99 

10 

i 2 

2 

0 

5 

67 

78 

11 

12 

12 

0 

6 

82 

87 

5 

1 6 

9 

3 

7 

49 

57 

8 

1 17 

17 

0 

8 

64 

68 

4 

13 

15 

2 

9 

58 

77 

19 

16 

13 

3 

10 

85 

89 

4 

4 


3 

11 

86 

94 

8 

3 


1 

12 

83 

97 

14 

i 5 


2 

13 

71 

79 

8 

i 11 

11 

0 

14 

46 

54 

8 

18 

19 

1 

15 

72 

92 

20 

10 

5 

5 

16 

61 

64 

3 

14 

16 

2 

17 

41 

46 

5 

19 

20 

1 

18 

91 

100 

9 

1 

1 

0 

19 

81 i 

91 1 

10 

7 

6 

1 

20 

80 

86 

6 

8 

10 

2 

Averages 

69.0 

78.9 

9.9 



1.5 


exactly equal in difficulty. The matter of equality of diffi- 
culty was again a question of the opinion of the teacher; she 
made them as nearly equal as she could judge. 

The two examinations covered the same general scope of 
classwork and appeared to be interchangeable in use; i. e., 
so far as the teacher could teU, one was just as good as the 
other, and she was quite willing that either should have been 
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used aUme as a measure of her pupils. This point has an 
important bearing on our discussion. 

These two examinations were very reliable for tests of 100 
items. The reliability coefficient was found to be 0.95. i 
The conditions of this experiment would be analogous to 
the situation of a test having been given but the test papers 
accidentally destroyed. A re-test was made up to be as 
nearly equivalent to the first as possible. The question 
is; ■^at difference in the scores of individual pupils might 
be expected? 

With these facts in mind, we can turn to the results shown 
in Table 73. 

1. We note that Test I averaged about ten points harder 
(lower) than Test II, a common finding under such cir- 
cumstances. 

2. No pupil obtained exactly the same score on both 
tests. The least difference found was three points, the 
largest disagreement was twenty points, and the average 
disagreement was about ten points (9.9). 

3. When the actual scores were changed into rank-orders, 
five pupils received exactly the same rank on both tests. 
The smallest difference in ranks was zero, the largest proved 
to be five, and the average was 1.5. The differences in ranks 
were much smaller than the differences in the actual scores. 

4. The tests used in this experiment were unusually 
reliable (0.95) in the light of actual practice. There can be 
no exaggeration of the facts on this score. The one possible 
element in the situation which may not be typical is the 
fact that one test averaged ten points easier than the other. 
Extensive investigations have shown that average differ- 
ences of from eight to twelve points are roughly the expec- 
tancy when dual examinations are given.® 


iSee Chapter XV for the meaning and calculation of reliability coefficients. 
2See Chapter III. 
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5. When two independent (but supposedly equivalent) 
examinations are given, one or all of three things may 
happen: 

(а) The tests may prove to be of unequal difficulties (i. e., 
yield different average scores, as in the present case). 

(б) The tests may show unequal variabilities (i. e., the 
range or spread of score may be quite different). 

(c) There may be little correlation or correspondence on 
the scores of individual pupils (i. e., the scores might show 
very different rank-orders on the two tests). 

6. In the present case, (a) was found to be true, but 
inspection will show that (b) and (c) are not serious disturbing 
factors. All in all, therefore, we have a situation which can 
hardly be held to be out of the expectancy. 

Ranks vs. per cents. The test under study was purposely 
made to yield a maxumun of 100 points. This was done in 
order (1) to make it comparable with the ordinary examina- 
tion which is usually graded upon a basis of 100 (or 100%), 
and (2) to raise an issue about the 100-point (100 per cent) 
grading plan. 

The previous discussion in this chapter has already 
pointed out that there are two general types of grading 
practices employed widely at present: grading from 0 to 100 
(often stated as per cents) and grading by some ranking sys- 
tem based upon the normal curve, e. g., stated percentages 
of A, B, C, etc., or some equivalent designation. 

It is also quite evident from the previous content of this 
chapter that the merits of these two rival methods have 
been the subject of voluminous controversy. The present 
experiment contributes data pertinent to the issue. 

Numerical (obtained) test scores or examination marks 
are used in two ways: (a) as they stand (63, 78, 91, etc., or 
sometimes 63%, 78%, 91%, etc.), or (6) translated into 
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letter grades by some such plan as the one which is out- 
lined below. 

A 95 or above 

B 85 to 94 

C 75 to 84 t Plan A 

D 70 to 74 

E (Failure) Below 70 ^ 

The exact numerical points of division naturally differ 
somewhat from teacher to teacher or from school to school. 
For our purposes, it matters little where the breaks come. 

The second (and newer) method of assigning grades is 
based upon the idea of the normal curve (probability curve). 
It is essentially a counting-in process. Stated percentages of 
each letter-grade are given, regardless of the exact magni- 
tudes of the original obtained numerical grades. This 
method is fundamentally a ranking method. A common 
scheme (and the one adopted for comparison here) is: 

A Highest 5% 

B Next 20% 

C Middle 50% ^Plan B 

D Next 20% 

E (Failure) Lowest 5% 

For convenience we have termed these two plans. Plan A 
and Plan B. Tables 74 and 75 show the two plans applied 
to the data of Table 73. 

Discussion of Tables 74 and 75. It will be necessary for 
us to keep our bearings in discussing Tables 74 and 75. In 
the first case we must exclude the question of whether the 
counting-in (normal curve) idea is strictly applicable in 
comparing different classes of pupils. This point has al- 
ready been discussed in a previous section. We are con- 
cerned here with successive (and supposedly equally valid) 
examinations of the same pupils.^ 

iThe problem here also is not equivalent to whether raw scores or ranks show the same 
correlation, or whether rank correlations are better or worse than omrelations of actual 
scores. The two correlations are substantially the same. 
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TABLE 74* 

Plan A Applied to the Two Sets of Examination Marks of Table 73 


Letter 

Grade 

Basis op 
Assignment 

Test I 

Test II 

A d 

95 or above 

j 

None 

No. 18 (100) 

No. 4 ( 99) 

No. 12 ( 97) 

B 

85 to 94 

No, 18 (91) 

No. 4(89) 

No. 11 (86) 

No. 10 (85) 

No. 11 ( 94) 

No. 15 ( 92) 

No. 19 ( 91) 

No. 10 ( 89) 

No. 3 ( 88) 

No. 6 ( 87) 

No. 20 ( 86) 

C 

75 to 84 

No. 12 (83) 

No. 6(82) 

No. 19 (81) 

No. 20 (80) 

No. 3 (76) 

No. 13 ( 79) 

No. 5 ( 78) 

No. 9 ( 77) 

No. 2 ( 76) 

D 



None 

E 

Below 70 

No. 5(67) 

No. 8(64) 

No. 16 (61) 

No. 2(60) 

No. 9(58) 

No. 7(49) 

No. 14 (46) 

No. 17 (41) 

No. 1 (38) 

No. 8 ( 68) 

No. 16 ( 64) 

No. 7 ( 57) 

No. 1 ( 56) 

No. 14 ( 54) 

No. 17 ( 46) 


♦Individual pupils are designated No. 1, No. 2, etc., to correspond with Table 73. The 
numbers in parentheses are the actual examination marks. The bold-face entries represent 
disagreements between letter marks by Tests 1 and II. 


The question is that of illustrating whether actual numeri- 
cal scores on the basis of 100 (or 100%) are as accurate as 
the method of counting-in (the normal curve idea) when ex- 
aminations, if repeated, show (a) differences in average diffi- 
culty, and (b) differences in variability, in addition to (c) those 
differences due to unreliability, proper. Factor (c) has been 
controlled to a degree whidi make the present data far more 
reliable than would be true in the average run of events. 
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TABLE 75* 


Plan B Ah>lied to the Two Sets of Examination Marks of Table 73 


Letter 

Grade 

Basis op 
Assignment 

Test I 

Test II 

A 

Highest 5% 

No. 18 (91) 

No. 18 (100) 

B 

Next highest 
20% 

No. 4(89) 

No. 11 (86) 

No. 10 (85) 

No. 12 (83) 

No. 4 ( 99) 

No. 12 ( 97) 

No. 11 ( 94) 

No. 15 ( 92) 

c 

1 

Middle 50% 

No. 6(82) 

No. 19 (81) 

No. 20 (80) 

No. 3 (76) 

No. 15 (72) 

No. 13 (71) 

No. 5 (67) 

No. 8(64) 

No. 16 (61) 

No. 2(60) 

No. 19 ( 91) 

No. 10 ( 89) 

No. 3 ( 88) 

No. 6 ( 87) 

No. 20 ( 86) 

No. 13 ( 79) 

No. 5 ( 78) 

No. 9 ( 77) 

No. 2 ( 76) 

No. 8 ( 68) 

D 

Next 20% 

No. 9(58) 

No. 7 (49) 

No. 14 (46) 

No. 17 (41) 

No. 16 ( 64) 

No. 7 ( 57) 

No. 1 ( 56) 

No. 14 ( 54) 

E 

Lowest 5% 

No. 1 (38) 

No. 17 ( 46) 


^Individual pupils are designated as No. 1, No. 2, etc., to correspond with Table 73. 
The numbers in parentheses are the actual examination marks. The bold-face entries 
represent disagreements between letter marks by Tests I and II. 


Moreover, the differences termed (a) and (b) are not extreme, 
as experiments have shovm. 

Plan A diows twelve disagreements in final letter grades 
for the two tests. Plan B shows but six (the cases in bold- 
face type). Plan B, of necessity, shows no disagreements in 
the numbers of each letter-grade given. Plan A shows many 
such variations in letter-grades from one test to the next: 


Distribution of Letter-grades by Plan A 




B»s 

C’s 

D^s 

E's 

(Failures) 

Test I 

0 

4 

7 


9 

Q 

Test II 

3 

o 

4 

0 

c 




D 
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Let US assume that we have no other basis for assigning 
marks to these twenty pupils than the results of Test I and 
II. Which set is better? There is no answer. They corre- 
late well, and we must assume that they are approximately 
equally good. Which plan for marking is the better? The 
answer seems to favor Plan B, both on the basis of the 
number of disagreements and on the basis of uniformity of 
different letter-grades given. 

If the results of this experiment are t 3 T>ical, differences of 
difficulty (and of variability, if present) from one examina- 
tion to another very largely destroy any validity which 
fixed or absolute standards seem to have in theory. (By fixed 
standards we mean stated passing marks, stated numerical 
limits of A’s, B’s, etc., and other direct uses of obtained 
numerical scores when these are turned into the 100%-scale 
of markings.) Note, however, that these objections to the 
actual numerical scores are not based upon their inherent 
value, reliability, or validity, but upon the attempt to 
transmute such actual numerical scores into per cents (or in 
general the familiar scale of marks beginning at 0 and ex- 
tending to 100). 

A perfectly valid and reliable test would rank every pupil in a 
group in exact order of merit. Moreover, if the test were per- 
fectly valid and reliable, a pupil obtaining 90 would probably 
be more superior to one obtaining 85 than the latter would be 
to one obtaining 84, but these scores would not represent per 
cents for many reasons, among which are: 

(1) The zero point of such a series of scores is unknown. 

(2) Such a scale ends at 100 only if the maximum score 
happens to be 100. Had there been 113 items, 113 would 
have been a perfect score. Had there been 77 items, no 
pupil could have earned “100.” 

(3) Had a harder test been given, the scores would have 
averaged much lower. To answer aU the questions on an 
easy test is to accomplish less than to answer ninety out of 
a hundred on a very difficult one. 
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(4) The units of scores on tests and examinations are 
arbitrary. Where they begin and end depends upon the 
difficulty of the test. Moreover, the units change in value 
from one part of the test to the next, the first ten points are 
usually easier to obtain than the last ten points. Test scores 
are not numbers in the arithmetic sense which permits 
addition, subtraction, multiplication, division, and other 
operations regardless of whether the numbers are near zero 
or are very large. 

(5) Unless much is known about the meaning of the 
nxmierical scores on a test (and seldom is this condition well 
met in practice), it seems to be safer to regard the scores as 
rank-orders. 

(6) The more reliable the test, the more the resulting 
ranks may be taken at face value. 

(7) The 100%-scale should probably be abandoned in 
favor of methods more nearly approximate to the known 
facts about individual differences. This is especially true if 
objective tests are employed where the numerical scores are 
functions of such facts as the length of the test, the difficulty 
of the test, etc. 



Part IV 

STATISTICAL TREATMENT AND INTER- 
PRETATION OF OBJECTIVE TEST RESULTS 




CHAPTER XV 


STATISTICAL PROBLEMS 
RELATED TO MEASUREMENT^ 

Introduction. A number of statistical measures like the 
standard deviation and the coefficient of correlation have 
been mentioned freely in this volume, particularly in Part 
III. Some of these have received casual and superficial 
definition in passing. This chapter brings together as seven 
Problems the basic statistical processes and concepts neces- 
sary to a reasonably secure mastery of the principles of 
test interpretation. 

Although the time has not yet come when the writer on 
educational measurement dares to take for granted that all 
teachers are thoroughly conversant with elementary statis- 
tical concepts, we have certainly reached the stage where no 
apologies are necessary for introducing such topics. 

The discussions of this chapter center about a very few 
elementary procedures in the application of statistical 
method to such problems as the summarizing of test scores, 
the critical evaluation of tests, the determination of reliabili- 
ty and the accuracy of individual test scores, and the 
interpretation of the significance of correlation coefficients 
as measures of relationship and prediction. It is to be hoped 
that the reader will consult the references mentioned in the 
footnotes and the General Bibliography. 

PROBLEM 1 

Stjmmarizing a Series of Test Scores 

All teachers are already familiar with the arithmetic mean 
or "‘average’' and with its general uses and significance. 

^Thia chapter has been purposdly reserved to the end of the volume so that it may be 
omitted without mudi loss to the general reader. The student will probably find it very 
much worth while to read throu£^ the discussion of the seven PiooUms in this chapter. 

405 
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Most educators have also come to think easily in terms of 
the median or mid-measure as well. The arithmetic mean 
(ordinarily called the average or simply the “mean”) and 
the median are the two most common measures of central 
tendency. The expression “central tendency” is worth 
knowing since its obvious meaning helps to define the 
concept of the average or the median. 

Ordinarily the teacher will need to summarize scores or 
marks upon a comparatively small number of pupils; most 
often fifty or fewer. In such a case, long methods of cal- 
culation will not be found very uneconomical. On the 
other hand, any teacher must find a great many averages in 
the course of a school year. To know a few short-cuts, if 
these are easy to learn, will save much useless figuring. 

The average (arithmetic mean) is ordinarily defined as: 
the sum of the scores (or other numbers) divided by the 
nxunber of scores summed. The statistician usually repre- 
sents a score or other niunber by the letter X, and the 
number of cases (scores) by the letter N. He also indicates 
the operation of summing by the Greek letter sigma ( S). 
Using these letters, the formula for the arithmetic mean is 
"ZX 

M =— . This formula is read: “The mean equals the sum 

of the scores divided by the number of scores.” There 
is an X for each pupil in the class, and N represents the 
number of pupils (in the case of tests, scores for a class). 

Three methods of calculating the arithmetic mean or 
average may be described briefly. These are: 

1. The long method for ungrouped measures. 

2. The short method for ungrouped measures. 

3. The short method for grouped measures. 

Table 76 gives the scores earned by a class of thirty 
pupils on an objective test. This table will sawe to illus- 
trate the “long” and “short” methods for ungrouped 
measures (scores). 
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The long method for nngronped measures. Column 
ib) of Table 76 shows the scores of the thirty pupils. The 
computational steps are: 

1. Add the thirty measures. (Sum =2334.) We can ex- 
press this step as SX=2334. 

2. Divide the sum of the measures by N (30). The arith- 
metic mean (M) is therefore: 2334-5-30=77.8. 

The short method for nngronped measures. Table 76 
also shows the computations for this method (especially 
column c). Note that we have introduced a new symbol, 
x'. The meaning of this will become clear in the outline of 
the steps employed in the short method. 

1. Inspect the series of scores or numbers carefully. 
Select some value which appears {without computation) to be 
a reasonable guess or estimate of the average. Call this 
value M'. In Table 76, the value 80 was taken as M' (known 
variously as the “guessed average,” the “assumed mean” 
or the “arbitrary origin”). 

2. Subtract each measure {X value) from the guessed 
mean (M'), algebraically, i.e., observing signs. (See column 
c.) It is often helpful, as was done in Table 76, to keep 
positive and negative values of these differences {%') in 
separate columns. 

3. Add the %' values, obtaining their algebraic sum (—66 
in our example). 

4. Substitute in the following formula and solve: 

30 

=80—2.2—77.8 (arithmetic mean). 

It should be noted that the final value of the mean is the 
same by both long and short methods. This will always be 
found to be the case, as the short method auiomcAically cor- 
rects the guessed mean {M'). It makes no difference whether 
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the guessed mean {M') is taken near the true mean (M) or 
whether we make a “bad” guess, the correction is made 
automatically. However, a poor guess does increase the 
size of the munbers to be handled and increases slightly the 
danger of a computational error. Since we are interested in 
saving labor, the attempt should be made to secme a fairly 
reasonable estimate at the outset, but, spending muck time 
on the selection of M' will defeat the sole purpose of the short 
method, viz., economy of time and labor. 

The short method does not always represent time saved. 
It usually is advantageous if twenty or more numbers must 
be handled. The principal advantage of the short method 
is that it gives us small numbers to deal with and hence 
reduces mental effort with consequent probability of fewer 
computational errors and re-checkings of calculations. 

The short method for grouped measures. If a large 
number of measxures (say, 50 to 100, or more) are to be 
averaged, statisticians often simplify their computations 
by what is called grouping. In other language they form a 
frequency distribution of the measures. Grouping is essen- 
tially the throwing together, as classes, measures which fall 
close together. We shall illustrate the method by the use 
of the same thirty scores of Table 76. The steps follow: 

1. Fmd the range of scores (the difference between the 
highest and lowest measure). In this case the range is 
99 -40 = 59. 

2. Divide the range (here 59) by some number which will 
give a quotient of from 15 to 20. Three would be the best 
number in this case. We will call three the range of the class- 
interval and designate it by the symbol i. This meeins that 
we will group the thirty scores to the nearest three. 

3. Form class-intervals by 3’s, proceeding from the large 

values to the small thus: 98 to 100 

95 to 97 
92 to 94 Etc. 



410 THE OBJECTIVE OR NEW-TYPE EXAMINATION 

Taking the first mentioned class-interval (98 to 100) as 
an illustration, this procedure means that we will consider 
all scores of 98, 99, or 100 as falling in the same class. More 
specifically, we will consider scores of 98, 99, and 100 as all 
falling at the mid-point (99) of that class (98 to 100). 

We usually choose classes so that the range of the class- 
interval (0 is 2, 3, 5, 10 (or some multiple of 5), when 
possible, but the important thing is to be sure that the total 
number of classes does not fall much below, say, fifteen. 

4. Classify the measures falling into each class-interval 
as shown in Table 77. Note that we have used column (6) 
for tallying the frequencies (numbers) falling in each class- 
interval, and that column (c) collects the tallies as numbers 
for the sake of convenience in subsequent multiplications. 

5. Guess or assume some average IM')- This was taken 
to fall in the class-interval, 74 to 76. liX order to give some 
particular number as the value of M', the mid-point (75) of 
this class is taken. (This is equivalent to assuming that in 
this class, as in all others, all measures falling within that 
class lie exactly on the mid-point of the class.) Draw lines, 
as shown, to indicate the class-interval in which the mean 
was assumed to lie. 

6. Calling the deviations of each class-interval from the 
class-interval in which the mean was assumed to lie (74 to 
76) x', write in the deviations (x') both ways from the 
asstnned mean, as shown in column (d). Give signs to these 
values to show whether the deviations are above or below 
the assumed mean. Note that the values of x' are in terms 
of class-intervals, not actual scores. Thus the class 71-to-73 
is taken at —1, meaning that it is one class below the 
assumed mean. In reality, the measures f alling in the class 
71-to-73 are three scores below the class 74-to-76, in which 
the mean was assumed to lie, since the grouping was by 3’s. 
The formula used for finding the arithmetic mean for 
grouped distributions automatically takes care of the fact 
that the stated x' values are but one-third of their true size. 
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TABLE 77 

The Scores of Table 76 Re-Classified as a Frequency Distribution 
Using 3 as a Class-Interval 


(a) 

(6) 

(c) 


w 

Class-Interval 

Tally 

/ 



98 to 100 

/ 

1 

8 

8 

95 to 

97 

/ 

1 

7 

7 

92 to 

94 


6 


89 to 

91 


5 

5 

25 

86 to 

88 

//T 

3 

4 

12 

83 to 

85 

/// 

3 

3 

9 

80 to 

82 


2 

2 

4 

77 to 

79 

/// 

3 

1 

3 

74 to 

76 

/ 

1 

0 

(68) 

71 to 

73 

/ 

1 

-1 

-1 

68 to 

70 

mi 

5 

-2 

-10 

65 to 

67 

/If 

3 

-3 

-9 

62 to 

64 


-4 


59 to 

61 



-5 


56 to 

58 



-6 


53 to 

55 



-7 


50 to 

52 

/ 

1 

-8 

-8 

47 to 

49 



-9 


44 to 

46 



-10 


41 to 

43 



-11 


38 to 

40 

/ 

1 

-12 

-12 


S (Sums) 

30 

(-40) 



68 

Af=M'+i^'=75+3 11=77.8 

N 30 


-40 

■~28 


7. Multiply each by the corresponding/ and record the 
products in the / a;' column (column €)• 

8. Add the /a;' column. 

9. Substitute in the formula: 

=75+3^=75+2.8 =77.8 

It happened in this case that we obtained the same value 
for the mean as we did by the two preceding methods. This 
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exact correspondence was a matter of chance. Ordinarily 
there will be a slight disagreement between the means 
figured by grouped and ungrouped methods. The reader 
may test this by re-classifying the thirty scores with a 
grouping by 5’s. As a rule, if N is fairly large and the 
number of class-intervals does not fall much below 15, the 
grouped distribution will yield a mean very close to the 
ungrouped value. A little thought will show that grouping 
will inevitably distort the actual values somewhat but that 
this distortion tends to cancel out in the long run. In any 
event, examination marks and test scores are at best es- 
timates of the facts, and hence a slight distortion by group- 
ing has little significance in comparison with the economy of 
the method. As was noted before, a distribution with but 
thirty cases is not adequate to show the real economies of 
grouping. Had there been 300 pupils, the saving in time 
and energy would have been very evident. Like the two 
preceding methods (for ungrouped measures), it does not 
matter particularly which class-interval is assumed to in- 
clude the mean. A bad guess increases the computational 
labor, but the formula automatically corrects for the poor 
judgment. 

PROBLEM 2 

Determining the Reliability of a Test 

This volume has made repeated reference to correlation 
or the mathematical relationship between two sets of test 
scores or other values. In particular, the terms reliability 
coefficient and validity coefficient have been used frequently 
without presenting any exact definitions. Because of the 
very great importance of correlation methods in test con- 
struction a brief introduction to the statistics of the measure- 
ment of relationship is given here. A concrete illustration is 
chosen as an introduction to the calculation of the coefficient 
of correlation. 
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As one assignment in a correspondence study course^ the 
author outlines the following project: 

1. Make up at least 20 or 25 broad essay-t 3 ^ or discussion questions 
(any subject). If possible, have each question call for something more than 
pure facts, i.e., call for some judgment, reasoning, originality, and organi- 
zation of thought. 

2. From the 20 or 25 questions select two sets of 10 questions each 
(20 in all). Make these two sets of questions as equal in difficulty as 
possible. The aim here is to prepare two examinations which are equally 
difficult, valid, and reliable so far as it is within your power to judge. 

3. Call one set of questions ‘‘Examination A"' and the other “Examina- 
tion B.” Give “A" one day and “B” the next day, both to the same 
pupils. 

4. Erase the pupils' names before grading and replace the names by 
numbers. Then shuffle the papers and grade both sets upon the scale of 
1(X) points. Etc. 

Below are the results from this experiment on a class of 
twenty-two pupils. 


Pupil 

. 1 

2 

3 

4 

5 

6 

7 

S 

9 

10 

11 12 

13 

Examination A. 

.90 

88 

72 

53 

62 

64 

86 

57 

65 

87 

87 91 

83 

Examination B. 

.79 

69 

75 

47 

62 

67 

72 

40 

47 

62 

90 78 

63 

Pupil 

.U 

15 

16 

17 

IS 

19 

20 

21 

22 


Mean 


Examination A. 

.61 

78 

67 

91 

91 

94 

75 

90 

67 


77.2 


Examination B. 

.32 

59 

71 

87 

53 

78 

51 

66 

58 


63.9 



Table 78 shows the tabulation of the above scores together 
with the steps in the calculation of the coefficient of corre- 
lation. To distinguish the scores or marks on the two ex- 
aminations, we have designated the A scores as X and the B 
scores as Y. Then and y' represent deviations of the X 
and Y from their respective guessed means (M* and 
The values given in the row of sums at the bottom of 
Table 78 furnish the necessary facts for solving for r, the 
coefficient of correlation. The solution follows Table 78. 

^Education 312AB, University of California, “New-Type or Objective Examinations, 
Assignments 11-12.” The actual results quoted were contributed by Mrs. Violet G. Prather, 
Wasco, California. 
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regarded as two independent measures or samplings of the 
same thing, we call such a correlation coeflBcient a reliability 
coefficient. The particular formula employed in this solution 
is one form of what is called the Pearson product moment 
formula for correlation. 

The reader will be interested to note that the calculations 
necessary for finding r also provides the needed data for the 
calculation of the arithmetic means as well. To find M* 
and My, we need only substitute the values for the 2:a:'(49) 
and the sy(— 24) in the formula previously given for the 
means for ungrouped scores by the “short method.” Thus: 

M,=Af',+^ = 75+g=77.2 

M, = =65+^=63.9 

Returning to the reliability coefficient, it should be 
pointed out that the exact procedure followed in our illus- 
tration applies strictly only to the case where two duplicate 
or equivalent examinations have been given to the same 
pupils. There are in all three common methods of finding 
reliability coefficients: 

1. By correlation of the scores firom duplicate or equiva- 
lent examinations administered to the same pupils (as in 
the foregoing discussion). This is ordinarily the most ac- 
curate zuid defensible method. 

2. By splitting the results fi'om a single examination into 
chance halves, correlating the half-scores, and “stepping up” 
the resulting coefficient of correlation by means of file 
Spearman-Brown prophecy formula (to be described later). 

3. By repeating the same test or examination after an 
interval and correlating the results. This is often called the 
“re-testing coefficient of reliability.” This method should 
never be employed when the first or second methods are 
possible. The procedure for this method is exactly like that 
diown in Table 78. 
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Since teachers usually give a single examination over a 
particular unit of school work. Method 2 (chance halves) is 
probably of the most general utility. For this reason we 
shall illustrate the method by a second actual example.^ 

In order to keep the illustration close to a real school 
situation, an actual examination from a seventh-grade class 
in physiology and hygiene has been selected. The original 
scoring of the teacher has been used in all the computations. 
The examination included these five questions; 

1. What are the two most important aims of physiology? 

2. State five rules of health t^t upper-grade children should know 
and practice. 

3. Name the two classes of muscles and give examples of each kind. 

4. Give two uses of the bones. 

5. Name three digestive juices and tell what each does to the food. 

Thirty pupils wrote on these questions, and each question 
was graded on a basis of twenty points. In order to find out 
just how reliable the results of this examination were, the 
scores earned on the odd-numbered items (questions 1, 3, 
and 5) were added separately from those of the even- 
numbered items (questions 2 and 4). This is known as 
‘lareaking the examination into chance halves.” Since but 
five questions were given, the “halves” cannot be made to 
contain the same number of questions, it being necessary to 
place three questions in one of the chance “halves” and but 
two questions in the other. This will not make any very 
important difference in the correlation. Table 79 gives the 
data as tabulated from the thirty papers. 

The next step in the solution of our problem is that of 
obtaining the correlation of the chance halves of the ex- 
amination. The method of carrying out the actual computa- 
tions is shown by the solution in Table 79. The odds will 
be designated by X and the evens by Y, and these scores 
have been tabulated in the columns so designated. 

^Taken, with changes, from the Improvement of the Written Examination^ pp. 132£f. 
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TABLE 79 

Total Scores and Scores on Chance Halves of an Examination in 
Physiology and Hygiene for a Seventh-Grade Class of Thirty 
Pupils 


Pupil 

Scores by Questions 

Score 

ON 

ODDS 

Score 

ON 

Evens 

Total 

Score 

I 

II 

III 

IV 

V 

1 

10 

16 

10 

5 

5 

25 

21 

46 

2 

10 

20 

20 

10 

15 

45 

30 

75 

3 

10 

20 

20 

10 

10 

40 

30 

70 

4 

10 

20 

10 

15 

10 

30 

35 

65 

5 

10 

20 

10 

10 

10 

30 

30 

60 

6 

20 

20 

20 

10 

17 

57 

30 

87 

7 

10 

20 

20 

20 

17 

47 

40 

87 

8 

20 


20 

8 

15 

55 

28 

83 

9 

18 


20 

15 

20 

58 

35 

93 

10 



20 

10 

10 

40 

30 

70 

11 

20 

16 

10 

15 

17 

47 

31 

78 

12 

20 

20 

10 

0 

0 

30 

20 

50 

13 

20 

20 

20 

3 

13 

53 

23 

76 

14 

18 


20 

10 

10 

48 

28 

76 

15 

10 


20 

10 

15 

45 

26 

71 

16 

12 


10 

10 

17 

39 

20 

59 

17 

10 

20 

20 

14 

15 

45 

34 

79 

18 

20 

20 

20 

18 

20 

60 

38 

98 

19 

10 

20 

10 

wm 

8 

28 

27 

55 

20 

15 

20 

10 


20 

45 

40 

85 

21 

15 

20 

20 

wM 

5 

40 

33 

73 

22 

18 

20 

20 


18 

56 

38 

94 

23 

10 

20 

15 

Ka 

10 

35 

30 

65 

24 

20 


15 

15 

8 

43 

33 

76 

25 

10 


20 

10 

15 

45 

26 

71 

26 

20 

20 

20 

20 

17 

57 

40 

97 

27 


15 

10 

8 

14 

44 

23 

67 

28 

18 

10 

20 

15 

7 

45 

25 

70 

29 


mJjm 

10 

20 

10 

40 

40 

80 

30 

18 

20 

12 

18 

10 

40 

38 

78 


The procedure from here on is identical with that of 
Table 78 where the scores from two different examinations 
were correlated. M'x was taken as 45 and M\ as 30. 
(See Table 80.) 

The solution for r is given in detail at the top of the next 
page. 
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The arithmetic means are: 

=45-13-43.7 

M,=Af',+-^=30+(§) -30+.7 -30.7 


Table 80 shows the data for the calculation of the re- 
liability coefficient by the method of chance halves. 

The reliability coefficient for the halves (odds vs. evens) 
■was found to be 0.37. This is to be interpreted as the re- 
liability of either half of the examination. What ■we need 
is the reliability of the whole examination, i.e., for the five 
questions. 

To obtain (meaning the correlation which would be 
expected between the whole examination actually given and 
a second but hypothetical examination of the same length 
which might be made up of five similar questions) from 
rjj (meaning the correlation of chance halves which was 
actually computed), we need only substitute the value 2 for 
n, and 0.37 for r in the following formula: 

r — 2L_ 

This formula is variously known as “Brown’s formula” and 
as the “Spearman prophecy formula,” the latter name being, 
perhaps, preferable. The solution for our problem is: 

r 2(0.37) 
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TABLE 80 


Calculation of the Reliability Coefficient by the 
Method of Chance Halves 


X 

(Odds) 

Y 

(Evens) 


B 


B 

x'y 

25 

21 

~20 

- 9 

400 

: 81 

180 

45 

30 

0 

0 

0 

0 

0 

40 

30 

- 5 

0 

25 

0 

0 

30 

35 

-15 

5 

225 

25 

-75 

30 

30 

-15 

0 

225 

0 

0 

57 

30 

12 

0 

144 

0 

0 

47 

40 

2 

10 

4 

100 

1 20 

55 

28 

10 

- 2 

100 

4 

-20 

58 

35 

13 

5 

169 

25 

1 65 

40 

30 

- 5 

0 

25 

0 

0 

47 

31 

2 

1 

4 

1 

2 

30 

20 

-15 

-10 


100 

150 

53 

23 

8 

- 7 

64 

49 

-56 

48 ! 

28 

3 

- 2 

9 

4 

- 6 

45 

26 

0 

- 4 

0 

16 

0 

39 

20 

- 6 


36 

100 

60 

45 

34 

0 

4 

0 

16 

0 

60 I 

38 

15 

8 

225 

64 

120 

28 

27 

-17 

- 3 

289 

9 

1 51 

45 

40 

0 

10 

0 

100 

0 

40 

33 

- 5 

3 

25 

9 

-15 

56 

38 

11 

8 

121 

64 

88 

35 

30 

-10 

0 

100 


0 

43 

33 

- 2 

3 

4 

9 

- 6 

45 

26 


- 4 

0 

16 

0 

57 

40 



144 

100 

120 

44 

23 



1 

49 

7 

45 

25 

0 

■B 

0 

25 

0 

40 

40 

- 5 

10 

25 

100 

-50 

40 

38 

- 5 

8 

25 

64 

-40 

S (Sums) 


-38 

+22 

2614 

1130 

+595 

Guessed 

Mean 45 

30 







The value for n is taken as 2, since the whole examination 
is two times the length of the half examinations. It is 
obvious that such an examination is not very reliable; in 
feet, it should be considered quite inadequate for purposes of 
measurement. Pupils tested with this examination would 
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in many cases earn very different marks if the examination 
were repeated on another day with five questions as similar 
in type to these as possible, but not identical in the knowledge 
called for. 

PROBLEM 3 

Uses of the Spearman-Brown Prophecy Formir-a 

We have already seen one use of the Spearman-Brown 
formula in connection with the physiology examination 
whose reliability for chance halves was found to be 0.37. 
“Stepping up” the 0.37 by the formula gave 0.54 as the 
estimated reliability of the entire exammation of five ques- 
tions. Many similar uses might be cited. Before doing this 
it diould be noted that Table 81 is very useful in obtaining 
directly approximately accurate results for many values of 
r and n. If more exact results are needed, interpolation is 
necessary.* 

We also found the examination of Table 78 to be .67 in 
reliability. We might wish to know the reliability of the sum 
or average of Examinations A and B. Examinations A and 
B are twice as long as either one alone ; n is therefore 2. By 
direct substitution in the Spearman-Brown formula, setting 
r equal to .67 and n equal to 2, we obtain .80 as our estimate. 
Exactly the same value is obtained by interpolation between 
.60 and .70 in Table 81. The reader shoidd note that the 
reliability of the sum and of the average of two examinations 
is always the same. 

^Interpolation may be illustrated as follows: Suppose the reliability of the chance 
halves of an examination is 0.37 (as found above for the physiolo^ examination). The 
value 0.37 does not occur in Table 79. We do find .30 and .40. Smce n equals 2 when 
halves are ^‘stepped up” to estimate the whole, we can interpolate as follows: 

r (when n =2) 

.30 .57 

.40 ^ 

.11 (difference) 

Since .37 is I’Jj of the distance from .30 to .40, the value of mt^ be (approximately) 
^ of the distance between .46 and .57, or A of or about .8. Adding .8 to .46 we 
have .54 as the estimated value of r„„ when r i j is .37. This agrees with the '^ue previously 
found by actual substitution in the Spearman-Brown formula. 
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TABLE 81 


Table for Obtaining Directly Values of r«« for the Spearman- 
Brotoi Prophecy Formula for Various Values of r and « 



Another use of the Speanueui-Brown fonnula may be 
illustrated, A teacher gave a true-false test of 80 items. 
The reliability of odds vs. evens proved to be .60. From 
Table 81 it is evident that the reliability of the whole test 
was about .75. Since the reliability of the halves was .60, 
we may say that forty items (half the test) showed a re- 
liability of .60. This teacher had hoped to obtain a relia- 
bility of .90 for her examination. The problem is to fund 
how many items (of the same kind as she used) will be 
needed to sdeld a reliability of .90. Starting with t=,60, 
and reading along the row toward the right, we find .90 in 
the column headed 6. This teacher will therefore need about 
6x40 items or 240 items to obtain the desired .90. This, 
of course, is an estimate. To be safe she should plan on at 
least 250 items. 
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Standard-test makers find the estimates yielded by the 
Spearman-Brown formula almost indispensible. The teacher 
who takes the trouble to calculate reliabilities on her more 
important tests will also find Table 81 of considerable value, 
at least for purposes of reference in her thinking. 

Table 81 also shows very clearly that if an examination 
requiring a period (30 to 60 minutes) of class time yields a 
reliability of less than .50, it is almost hopeless to attempt 
to secure a reliable test by lengthening such an exa min ation. 
It would have to be nine times as long to 3 deld .90 as the 
reliability (if r=.50). An examination of the degree of 
reliability represented by .30 or .40 could never be length- 
ened within the time justified in actual teaching to become 
very satisfactory. 

PROBLEM 4 

The Measurement of Variability or Dispersion 

It is well known that the average gives but a partial picture 
of the facts about a number of scores or marks. Two classes 
might have the same average while diowing marked differ- 
ences in the spread or scatter of the pupils about that aver- 
age. We say that the two dasses show a different variability 
or dispersion about the central tendency. 

As a crude measure of variability, the range of the scores 
is sometimes used, thus: 

Lowest Highest 

Class Average Score Score Range 

A 81 34 112 78 

B 81 69 99 30 

Class A shows a range more than two and one-half times 
as great as Class B. Although the range is often of consider- 
able value in expressing variability, it is open to the serious 
objection that it is too greatly influenced by occasional and 
chance factors which may give rise to one or more extreme 
deviates that are hardly typical. A single very bright or 
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very dull pupil may exert an enormous influence on the 
range. Consider the following facts: 


Class Average 
C 76 

D 76 


Second 
Lowest Lowest 
Score Score 

13 51 

50 52 


Highest 

Score Range 

102 89 

115 65 


Class C has a much larger range (89) than Class D (65). 
If we examine these situations more closely, we see that the 
larger range of Class C is due to one extreme deviate (very 
low score), the pupil earning the score of 13. If we should 
take a number of similar classes of the same size, it is very 
unlikely that we should often find scores as low as 13. That 
pupil is atypical or imusual. 

The usual measure of variability is the standard deviation. 
It is a more stable measure than the range and has the im- 
portant advantage of not being unduly influenced by oc- 
casional and atypical, extreme variations. The standard 
deviation is also referred to as sigma, and is abbreviated to 
S. D., or the Greek letter <r. The standard deviation re- 
quires considerable computation, although it is obtained 
easily as a by-product from the calculation of the coefficient 
of correlation. This will be clear from comparing the 
formulas below with Tables 78 and 79 and the subsequent 
calculations based upon these tables. We may give here 
four variations of the formula for the standard deviation, 
the choice among these four depending upon the needs in a 
particular instance, i.e., whether the measures are ungrouped 
or grouped and whether the deviations are taken from the 
true mean or from a guessed mean. 

'Ungrouped measures where the 
deviations (*) are taken from the 
true mean (AQ. 

'Grouped measures where the 
deviations (x) are taken from the 
.true mean (Jf). 


(1) <r= 


N ' 


fr)\ • 

( 2 ) 
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ISx'^ ( 


(Ungrouped measures where the 
.^deviations (x') are taken from a 
[guessed mean (MO- 


(4) 



{ Grouped measures where the 
deviations (*') are taken firom 
a guessed mean (M'). 


Fommla (1) will sarve best to define the standard devia- 
tion. We will now apply it to a very simple example, the 
X values being the marks of seven pupils. 


TABLE 82 

Calculation of the Standai® Deviation. 



The very simple example of Table 82 shows the essential 
nature of the standard deviation. We took the square root 
of the sum of the squares of the deviations, after dividing 
this sum by N. Some scientists, particularly astronomers 
and physicists, call the standard deviation the “root mean 
square deviation,” meaning that the square root is taken of 
the mean (or average) of the deviations squared. A standard 
deviation based upon but seven cases is, of course, mean- 
ingless. 

If we turn to the formula for r given in coimection with 
Table 78, we see at once that the two radical terms in the 
denominator are identical with Formula (3) above; except, 
of course, for the use of y' instead of x' in the second term 
in order to distinguish the two sets of scores. All that is 
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necessary in order to calculate ff* (the standard deviation of 
the X scores) and a-y (the standard deviation of the Y 
scores) is to extract the indicated square roots, thus: 



Solving the two radical expressions in the denominator 
of the r formula for the data following Table 79, we obtain 
cr,=9.2 and (Ty—6.1. 

Formula (3) is therefore basic to the calculation of the 
coefficient of correlation. When actual correlations are 
computed, the standard deviations require only a small 
additional calculation (the taking of the square roots of the 
two radical terms of the denominator). This gives the 
standard deviation as a by-product of the correlation process 
in much the same way that the means (averages) are yielded 
incidentally in finding r. The teacher need not avoid the 
use of the standard deviation because of extra labor, since 
the standard deviation is ordinarily not needed unless a 
critical study of reliability is undertaken, in which case 
'Txjn Mx, My, 0 %, and a-y are computed as one general 
calculation. 

We should now note certain facts about the standard 
deviation. If we assume the distribution to be normal, the 
facts shown by Fig. 16 on page 426 hold. 

Between the mean and either plus- or minus-one standard 
deviation, there are 34.13 per cent or slightly more than 
one third of the total cases in a normal distribution. For 
distributions which are roughly normal the same figures 
hold approximately. Beyond plus- or minus-one standard 
deviation there must be about one sixth of the total number 
of cases. Between plus-one sigma and minus-one sigma 
there are roughly two-thirds of the cases. Although these 
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facts hold, strictly speaking, only for normal distributions, 
most distributions of educational abilities for a random 
population will be foimd to be sufficiently normal to permit 



Fig. 16.— The standard deviation and the normal curve. 

treatment as such. We can check these statements in a 
rough way for the two sets of marks given in Table 78. 



X 

(Exam. A) 

Y 

(Exam. B) 

Mean 

77.2 

63.9 

Standard Deviation 

12.9 

14.5 

H-Lr 

90.1 (77.2+12.9) 
64.3 (77.2-12.9) 

13 (By actual 
count) 

78.4 (63.9+14.5) 

49.4 (63.9-14.5) 

15 (By actual 

count) 

— l<r 

Cases between +l(r and — l<r. . 


With a total population (iV) of 22 cases, the most probable 
number falling within the limits -Hi <r and — 1 <r is about 15, 
i.e., .6826 X22=15.0. The acttml cases included between 
-Hi cr and — 1 <r were 13 and 15, respectively. It must be 
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evident that the percentages stated above will hold but very 
roughly when N is small, as in this case. 

Before closing the discussion of the standard deviation 
as a measure of variability, we shoiild refer to a statistical 
measure which is commonly derived from the standard 



Fig. 17.— The probable error and the normal curve. 

deviation. Ease of thinking about distributions would be 
fostered by having a measure of variability which would 
include between the mean and that measure exactly twenty- 
five per cent or one fourth of the cases in a normal distribu- 
tion. Such a measure would divide the total distribution 
into quarters, fractions which are even more exact and 
simpler than those yielded by the standard deviation. 
Such a measure is the probable error. It is defined and cal- 
culated by means of the following formula: PE =.6745 «r. 

For Examinations A and B of Table 78, the probable 
errors are: 

PE, = .6745 X 12.9 = 8.7 (approximately 9 points) 

PEy = .6745 X 14.5 =9.8 (approximately 10 points) 
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It will be noted that the range between ■\-lPE and — IPE 
does not include even approximately one-half of the cases. 
This is due to two causes: (1) the small number of cases, 
and (2) absence of normality in the distributions. 

We shall make use of the probable error (PvE) in a later 
section. It is sufficient to remember that it is nothing more 
nor less than the standard deviation multiplied by .6745 in 
order to fraction the total distribution of a normal curve 
into exact quarters for the sake of ease of thinking. Fig. 17 
illustrates the probable error. 

PROBLEM 5 

The Accuracy of an Individual Score 

Having in mind the meaning of such statistical measures 
as the arithmetic mean, the standard deviation, the probable 
error, and the coefficient of correlation, we have the tools for 
attacking the very pertinent problem of the degree of 
confidence which may be placed in an individual score or 
mark. This problem is often stated as that of finding the 
“probable error of a score” — ^not to be confused with the 
probable error of the distribution as already defined. 

We are now familiar with one method of approach to this 
question, viz., the evaluation of a test through die calculation 
of its reliability coefficient, as in Table 78. It will be re- 
called that the same twenty-two pupils were given two 
supposedly equally good examinations. The correlation of 
the two sets of marks was 0.67, a low figure. This coefficient 
of correlation (here a reliability coefficient) shows that neither 
examination is highly dependable. It does not teU us directly 
how much error is likely to be present in the mark of any 
particular pupils. Pupil No. 1 earned 90 on Examination A 
and 79 on Examination B. Which is correct, or rather, 
which is more nearly correct? This pupil cannot very well 
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be both a “90” and a “79” pupil in any absolute sense. One 
fact is clear from our past calculations, viz., that Examina- 
tion B was about thirteen points harder on tibe average than 
was Examination A. 

Turn to Table 78 and try subtracting thirteen points from 
each mark in the X column. G>mpare these differences 
with the values in the Y column. Do the disagreements then 
disappear? By no means. In addition to the irregularities 
caused by the inequality of difficulty of the two examinations, 
there are other differences to be reckoned with. These 
differences are due to various sorts of unreliability — sam- 
pling, subjectivity, etc. 

Were the results from Table 78 to be used for grading 
pupils by some ranking process, the average difference of 
13.3 points (between means) would have small significance. 
If these marks are taken to be per cents or values on the 
scale of 100, the systematic difference of 13.3 points is 
serious. The fluctuating differences, not corrected by sub- 
tracting 13.3 from the marks on Examination A, would 
disturb either form of marking (ranks or per cents); the 
difference of 13.3 on the average would disturb only when 
per cents or some equivalent system are employed. We can 
see these situations more clearly from Table 83 on page 431, 
which presents further facts about the two sets of marks 
given in Table 78. 

In studying Table 83 it is necessary to keep in mind that 
the differences in the X—Y column are of two kinds: (a) 
those arising from the fact that Examination A was 13.3 
points easier than Examination B on the average, and {b) 
those arising from unreliability proper. The mean differ- 
ence (13.3) agrees, of course, wifli the difference previously 
reported in the calculations based on Table 78, where M, 
was fornid to be 77.2 and M, to be 63.9. The standard 
deviation of the differences proved to be 11.3, which, when 
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multiplied by .6745, gave 7.6 as the probable error of the 
differences. The value 7.6 is independent of the influence 
of the average difference of 13.3 points. ^ 

Table 83 sdelded 8.0 as the standard deviation of a score 
and 5.4 as the probable error of a score. As a matter of 
fact, we obtained these values in a most round-about manner, 
one which is seldom used in practice. This was done in- 
tentionally in order that something of the significance of 
fluctuations of scores of the same pupils from one test to 
another might be brought out. The method of Table 83 
should be regarded as a “long” method, chiefly of explana- 
tory interest to us. 

■^en the standard deviations (of the distributions) and 
the correlation of the two sets of marks or scores are known, 
as in the present instance, we can write formulas for the 
standard deviation or probable error of a score as follows: 


V* + Oy 

(Score) 9 "V 1 — Tx: 


P£(SC0«) = .6745-^^^ 


iThe truth of this statement may not be apparent. The standard deviation of the 
differences was calculated by a formula not previously given, but, of course, algebraically 
eqmvalent to all four formulas given on pp. 423-4. If we take the differences of the X—Y 
oMumn and regard them as a new set of X or raw score values (not to be confus^ with the 
marks from Examination A, meviously denoted by X>, the final column of Table 83 may 
be regarded as X^ values. The formula for the standmd deviation becomes m this case 

y N \N ) 


The second term under the radical is the same as Mx** The formula may therefore be 
rewritten 



This formula differs algebraically oi^ from the four others previously listed. It should 
be evident that the present formula differs from the others mwely m guessmg the averse 
at zero instead of some value nearer the truth. If the reader w^ study Table 83 carefully, 
it will be seen that the present formula is an alge1»aic variation of the others previously 


To return to the original purpose of this footnote, it need only be pointed out that the 
13.3 (or average difference m difSculty of the two examinations) is by the 

second term under the radical, viz., ( “ ) • 
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Due to the fact that the two standard deviations are 
likely to differ somewhat, the formula calls for averaging 
the two. 

Substituting in the foregoing formulas, we obtain: 


ffcscore) = Vl-.67 = 7.9 

= .6745 


— C O 


TABLE 83 


Calculation of the Probable Error of a Test Score 
(Data From Table 78) 


Pupil 

X 

(Exam. A) 

Y 

(Exam. B) 

Differences 

Ct-F) 

Squares of 
Differeences 

1 

90 

79 

-11 

121 

2 

88 

69 

-19 

361 

3 

72 

75 

3 

9 

4 

53 

47 

- 6 

36 

5 

62 

62 

0 

0 

6 

64 

67 

3 

9 

7 

86 

72 

-14 

196 

8 

57 

40 

-17 

289 

9 

65 

47 

-18 

324 

10 

87 

62 

-25 

625 

ir 

87 

90 

3 

9 

12 

91 

78 

-13 

169 

13 

83 

63 

-20 

400 

14 

61 

32 

-29 

841 

15 

78 

59 

-19 

361 

16 

67 

71 

4 

16 

17 

91 

87 

- 4 

16 

18 

91 

53 

-38 

1444 

19 

94 

78 

-16 

256 

20 

75 

51 

-24 

576 

21 

90 

66 

-24 

576 

22 

67 

58 

- 9 

81 

S (Sums) 



-293 

6715 


M(X>iSerence&) = —293 -7-22 — 13.3 

0-(Difference8) = 

PEcpiffcrences) = .6745 X 11.3 *=7.6 
<^(Score) =-707O' (Differences) = .707X11.3 =7.99=8.0 
PE(Score) = .6745 (Score) *= .6745 X8.0 = 5.4 
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The values given in the formulas at the top of page 431 
differ but one point in the first decimal place from those 
calculated in Table 83 by the “long” method. The two 
methods may disagree slightly in a particular instance, for 
reasons which will not be discussed here. 

We will consider the probable error of a score to be about 
5.3 Joints. ^ The next task before us is to understand the 
meaning of such probable errors. To do this, consider the 
errors in a series of test scores for a large number of pupils 
to be distributed in a normal fashion. If the actual errors 
were plotted as a normal distribution, the probable error 
of that distribution would be 5.3 points. This again means 
that in fifty per cent of the cases the errors would fall between 
— 1 PE and -fl PE, i.e., within a range of 5.3-i-5.3 or 10.6 
points. In other words, it is an even break or chance that 
any score is within 10.6 points of the correct score, were the 
latter obtainable. If a pupil took a very large number of 
tests similar to Examinations A and B, the average of all 
his marks would give his true score (if we neglect practice 
effects). The chances are therefore 50:50 that an obtained 
score is within 5.3 either way of the true score. 

A probable error of 5.3 points is rather disconcertingly 
large when we recall that half of the pupils will receive scores 
with errors larger than this value. It can be shown that the 
following statements are true: 

The chances are one to one that an obtained score is in 
error by not more than 5.3 points either up or down. 

The chances are four to one that an obtained score is in 
error by not more than 10.6 (2 PE) points either up or down. 

^The formula used here is not, theoretically, the best one to employ in case a highly 
accurate determination of the probable error of a score is needed. It ignores for one thing 
the fact that scores in different parts of the total range do not have equal ixjssibilities for 
error. Very high scores are somewhat more lik^y to be in error upwards, i.e., to be too 
high. The converse is also true. This is the regression effect already mentioned. A score 


The probable error formula adopted here may be thought of as a sort of “average probable 
^or of a score.’* For further information on this point see T. L. Kelley, Staltsitcal Method^ 
^acmil^, 1923), pp. 2l4ff; A S. Otis, Stattstical Method in Bducaitoncd Measurement, 
^orld B^k Co., 1925), pp. 247ff; and G. M. Ruch, “Minimum E^ntials m Reporting 
Data on Standard Tests,^ Journal of Educational Research, Vol. XII (IS^), pp. ^9-358. 




STATISTICAL PROBLEMS OF MEASUREMENT 433 

The chances are twenty to one that an obtained score is in 
error by not more than 15.9 (3 PE) jjoints either up or down. 

The statements just given show very clearly that with a 
test of a reliability in the neighborhood of .65, there is really 
no assurance that a pupil earning 75 is really not a “65” or 
an “85” pupil. If his mark was 75, it is cin even chance that 
he deserves a mark between 70 and 80. The chances are 
about four to one that he deserves a mark between 65 and 85. 
There is about one chance in twenty that he really should 
receive a mark as low as 65 or as high as 90. Since much 
evidence has been gathered by Monroe, the author, and 
many others to the effect tliat the usual examination falls 
very close to the value 0.65 in reliability, our present findings 
for the data of Tables 78 and 83 can hardly exaggerate the 
facts. To attempt to classify pupils in as small a number of 
groups as five (e.g.. A, B, C, D, and E) would be “shaky 
business” with such a test, as a little figming will diow. The 
range of marks on Examination A of Table 78 was firom 53 
to 94, or forty-one points. For Examination B the range was 
from 32 to 90, or fifty-eight points. The average range was 
therefore not far from fifty points. If the PE(Score) is five 
points and the range is fifty points, the following statements 
statements hold approximately: 

The chances are one to one that an obtained score is in 
error by no more than one-fifth of the range. 

The chances are four to one that an obtained score is in 
error by no more than two-fifths of the range. 

The chances are twenty to one that an obtained score is 
in error by no more than three-fifths of the range. 

PROBLEM 6 

What is a Satisfactory Degree of Reliabiuty? 

The question of what constitutes a satisfactory degree of 
reliability cannot be answered in other than relative terms. 
Assuming that the teacher’s usual interest in reliability 
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coefficients centers about the reliability of a test given in a 
single class or grade, the following statements can probably 
be defended 


RELIABIUTr 

Coefficient Interpretation or Significance 

.95 to .99 Very high; rarely obtained except livith long, carefully 

standardized tests. Long objective tests occasionally 
reach this level. 

.90 to .94 High; about the limit of teacher-made tests of (say) 

100 to 200 items. 

.80 to .89 Fairly high; usually obtainable with well-constructed 

objective tests of 75 to 150 items. Relatively few 
essay-type examinations reach this level. 

.70 to .79 Rather low; not of much value for purposes of evaluating 

individual pupils. 

Below .70 Low; almost valueless except for averages of classes. 

The average essay-type examination does not exceed 
.70 and is possibly lower. 


The foregoing statements must be taken with due caution, 
since there are many factors to be considered. Many will 
think the interpretations presented to be very conservative. 
Evidence to be brought forward later will diow that the 
reverse is probably true. The older textbooks on statistical 
methods have often tended to create the impression that 
correlations above .50 or .60 were at least “moderately 
high,” and that those above .75 to .80 were “very high.” 
This view is possibly partly to be explained upon the basis 
of a tendency to regard coefficients of correlation as per 
cents. Nothing could be farther from the truth. 

The reader may wonder whether there is any basis at all 
for answering the question, “When is a correlation high?” 

iThe confining of the discussion to a siMle grade or class was done intentionally. It 
is^a well-known fact that the same test will show very different rehabihty coefficients in 
a n^row” group like a single dass in comparison with a ‘Vide" group like one composed 
of a half dozen school grades pooled. We sometimes say that the cormation is dependent 
u^n the range of talent or heterogeneity of the group. Kelley has given us a formula for 
Memng correlation for a wide range from a known value on a small range, or vice versa. 
Bnef discussion of this point will oe made later in the chapter- 
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Correlations are employed for two general, although not 
distinctly different, purposes: {a) as measures of relation- 
ship, and (b) for prediction. It is with the latter use that 
we need to deal in approaching a basis for interpreting 
correlation coefficients. We may as well return to the data 
of Table 78 for our illustration. We have previously ob- 
tained the following statistical constants: 

Mx My SDx SDy Tjcy 

77.2 63.9 12.9 14.5 .67 

In Table 78 we have given two sets of scores or marks as 
actually obtained on Examinations A and B. We can think 
of these as experimental or actual values. (They must not 
be regarded as true values.) To the degree that these two 
sets of scores are correlated, it is possible to predict one set 
of scores from the other. 

If the correlation between the two sets of scores had been 
perfect (1.00 instead of .67), there would be no error in 
predicting the scores. If the correlation had been zero, the 
errors of prediction would be a maximum. 

The actual formulas used for prediction are: 

X=rx^(Y-My)+Mx 

<ry 

Y=rxy^ (X-Mx)+My 
<r* 

The bars above the X and Y are used to indicate that these 
are estimated or predicted values, not the obtained or actual 
experimental values. 

If we substitute the statistical constants found for Table 
78, we obtain the following: 

^=.67 ^ (F-63.9)+77.2, or jr=.60y-l-39.1 

■F=.67 ^ (A:-77.2)+63.9, or Y=.75X+5.8 
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These are called the legtessim equatims. It should be 
noted that there are always two such regression equations, 
and that they are not interchangeable. 

There are a great many educational situations where 
prediction by means of regression lines is possible and ad- 
vantageous, but it must always be remembered that predict- 
ing scores or other values is inferior to actually obtaining 
such experimentally. Thus, it might have happened that, in 
addition to the twenty-two pupils of Table 78, certain addi- 
tional pupils were absent and missed one or the other of 
the examinations. Instead of giving a later examination, 
the missing marks might have been predicted by the fore- 
going equations. It does not follow that the predicted mark 
would have been exactly the same one which the pupil 
would actually have earned. The higher the correlation, the 
safer the prediction method. Teachers often administer 
prognosis or prediction tests for purposes of sectioning 
classes or counseling pupils. Such tests have previously 
been proved to have such predictive powers. They are 
successful in practice only to the degree that their scores 
correlate with future success. Regression equations are 
often used to establish the best basis for such predictions. 

Purely as a definition and illustration of prediction, we 
shall apply the regression equation to the v^ues in the X 
column of Table 78; i.e., we shall predict the Y values which 
correspond to obtained or actual X values. There is no 
practical reason for doing this, except for illustration, sin« 
the actual Y values are known. However, comparing Y 
and Y values will give us a very good basis for arriving at 
some notions about the predictive value of a correlation of 
.67. Table 84 shows the results. The X column repeats 
the X column of Table 7^ The Y column repeats the Y 
column of Table 78. The Y column shows the F (predicted) 
scores for each X score in turn; the predicted scores being 
obtained by substituting each successive X value in the 
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TABLE 84 


Long Method of Calculating the Standard Error of Estimate 


X 

(Actual 
scores on 
Exam. A) 

Y 

(Actual 
scores on 
Exam. B) 

Y 

(Estimated Scores 
on Exam. B by the 
^ equation, 
y=^.75X+5.S) 

(Errors of 
estimate) 

(Squared 
errors of 
estimate) 

90 

79 

73.3 

- 5.7 

32.49 

88 

69 

71.8 

2.8 

7.84 

72 

75 

59.8 

-15.2 

231.04 

53 

47 

45.5 

- 1.5 

2.25 

62 

62 

52.3 

- 9.7 

94.09 

64 

67 

53.8 

-13.2 

174.24 

86 

72 

70.3 

1 - 1*7 

t 2.89 

57 

40 

48.5 

1 8.5 

72.25 

65 

47 

54.5 

1 7.5 

56.25 

87 

62 

71.0 

9.0 

81.00 

87 

90 

71.0 

-19.0 

361.00 

91 

78 

74.0 

- 4.0 

16.00 

83 

63 

68.0 

5.0 

25.00 

61 

32 

51.5 

19.5 

380.25 

78 

59 

64.3 

5.3 

28.09 

67 

71 

56.0 

-15.0 

225.00 

91 

87 

74.0 

-13.0 

169.00 

91 

53 

74.0 

21.0 

441.00 

94 

78 

76.3 

- 1.7 

2.89 

75 

51 

62.0 

11.0 

121.00 

90 

66 

73.3 

7.3 

53.29 

67 

58 

56.0 

- 2.0 

4.00 


2 =2580.86 

gyj = = 10-8 (Standard Error of Estimate) 


equation, F=.75JS'+5.8. The Y—Y column diows tl^ 
differences between the actual Y and the predicted Y 
values. The (F— T)* column shows the squares of these 
differences. _ 

The Y—Y values are temaed the errors of estimate. By 
squaring these errors of estimates, taking their sum, dividing 
this sum by N, and finally extracting the square root, we 
obtain the standard error of estimate. A more descriptive, 
but less common, name would be the “standard deviation of 
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the errors of estimate” (or prediction). The symbol, <Ty,„ 
is used to indicate the standard error of estimate (of the 
Y’s from the JCs). Note the calculation of the value of 
<Tyjc at the bottom of Table 84, 

In order to illustrate the method of obtaining the Y values 
of Table 84, the actual solutions are given below for the first 
three pupils. 


Pupil 

X Score 

Substitution in 
Equation 

i F OR Esti- 
mated Score 

1 

90 

F= (.75X90) -1-5.8 
Y= (.75X88) -1-5.8 
•F= (-75 X72) -1-5.8 

73.3 

2 

88 

71.8 

3 

72 

59.8 



Returning to the solution for the standard error of estimate 
at the bottom of Table 84, we can write the formula as 
follows: 

It was suggested previously that this “long method” of 
finding the standard error of estimate was chosen merely for 
purposes of definition of sudi concepts as estimated scores, 
errors of estimates, etc. In actml practice it is quite un- 
necessary to go through the laborious calculations of Table 84. 
The standard error cf estimate is given directly (for the estima- 
tion of Y values from X values) by the following: 

Oyjc Oy \^1 7 ^. 

Similarly, the standard error of estimating X values from 
Y values is: 

~ O’y's/X 

If we substitute in the formula, <r„^= the 

actual values that we found previously for Table 78, we 
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obtain <r„ ^ = 14.5 Vl — (.67)® = 10.7 as the value of the 
standard error of estimate. This agrees dosely vdth the 
result by the “long” method. 

We are now ready to apply what we have learned about 
the standard error of estimate to our original problem of 
finding a basis of interpret ing the meaning of correlation. 
In the formula, rl,,, r must take values be- 

tween zero and 1.00 (unity). The sign of r can be either 
positive or negative, but this does not affect the degree of 
prediction. If r is 1.00, the radical expression reduces to 
zero, and hence the value of the standard error of estimate 
( is zero. The prediction is perfect, since there is no 
error of prediction when r equals 1.00. Now note the other 
limiting value of r, viz., zero. When r is zero, the radical 
expression reduces to 1.00, and <Ty^ equals <ry. This shows 
that the standard error of estimate is exactly the same size 
as the standard deviation of the Y values when r is zero. 
There would be no guarantee at all that a pupil earning a 
very high score on Examination A (say, 95) would earn a 
high value on Examination B. In the absence of correlation, 
any value on Examination B would be as probable as any 
other. Since with zero correlation the errors of estimate are 
limited only by the variability of the Y scores, and no value 
of Y is more probable than any other for any given value of 
X, it may be said that zero correlation, if used for purposes 
of prediction, results in accuracy no better than chance.^ 

In our particular problem we found that with a correla- 
tion of .67 the standard error of estimate was 10.7, in com- 
parison with the standard deviation of 14.5 for the Y values. 
The ratio of is .74; or in other words, the variability of 
our errors of estimate showed a spread roughly three- 
fourths Jis great as the scores being estimated. In a certain 
sense, therefore, the correlation of .67 is but one-fourth of 

T. L. Kdley, Statistical Mahod (New York: The Macmillan Cmnpany, 1923)* 
pp. 172-4. 
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the way (in accuracy of prediction) from zero correlation to 
perfect correlation. 

Since the value of the standard error of estimate is always 
some fraction of that of the standard deviation, the fraction 
being given by the radical expression it is possible 

to build up a simple table of great value in interpreting the 
significance of r for prediction. Kelley has termed this 
radical expression the coefficient of alienation (referring to the 
alienation or departure of any given value of r from 1.00 or 
perfect prediction). 

Table 85 shows the coefficients of alienation and the per 
cent of reduction in the standard error of estimate for 
many values of r. 

TABLE 85 


Coefficients of Alienation and the Reduction in the Standaed 
Errors of Estimates for Various Values of r 


{a) 

(p) 

(C) 


Vl-rs 

Per Cent of Reduction 

r 

IN Standard Error 


Coefficient of Alienation 

OF Estimate 

.00 

1.000 

0.0 

.10 

.995 

0.5 

.20 

.980 

2.0 

.30 

.954 

4.6 

.40 

.917 

8.3 

.50 

.866 

13.4 

.60 

.800 

20.0 

.70 

.714 

28.6 

.80 

.600 

1 40.0 

.866+ 

.500 

50.0 

.90 

.436 

56.4 

.95 

.312 

68.8 

.96 

.280 

72.0 

.97 

.243 

75.7 

.98 

,199 

80.1 

.99 

.141 

85.9 

1.00 

.000 

100.0 


Table 85 gives us some rather definite ideas about the 
accuracy of prediction possible with vaiying values of the 
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correlation coefficient. For example, it takes a value of r 
equal to about .866 to reduce the standard error of estimate 
to .50 (half of the standard deviation). When r is .95, 
column (6) shows that the alienation from perfect prediction 
is still almost one-third. In general it should be noted that 
the rise in accuracy of prediction is very slow for small 
values of r (.00 to .50 or .60) ; the rise shows positive accelera- 
tion, becoming very rapid only after .95 is passed. Even a 
correlation of .99 shows an alienation of about one-seventh 
from i)erfect prediction. 

If the reader will now refer to the introductory statements 
about what constitutes high correlation, it will be granted 
that those statements were conservative indeed in the light 
of the mathematics of prediction. 

PROBLEM 7 

The Effect of Heterogeneity or Range of Talent on 
Reliability Coefficients 

There is one additional concept which the student needs 
in considering the significance of correlations, particularly 
reliability coefficients. We can introduce the discussion by 
sa 3 dng: When correlation exists between two sets of variables 
{scores, marks, etc.), the amount of correlation is a function 
{result) of the variability of the group sampled. This will be 
made clearer by the following example: 

A teacher prepared an objective test in United States 
history. She gave this to her classes in the sixth, seventh, 
and eighth grades. She computed the reliability coefficients 
for each grade separately and for the three grades pooled. 
For the eighth grade alone, r was found to be .77; for the 
three grades pooled it was .91. This appears to be a contra- 
diction. As a matter of fact, it is the expectancy. The more 
grades pooled, i.e., the greater the range of talent or abilities, 
the larger the expected correlation, other factors being the 
same. 
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In order to test out whether these two values of r, .77 and 
.91, are contradictory, we need certain additional facts, 
principally some measures of the relative variabilities of the 
two groups. The standard deviations are again best for our 
purix)ses. 

RELIABILITy STANDAKD 

Grade Range Tested Coefficient Devution 

8th grade only .77 (r) 10.2 ( <r) 

6th. 7th, and 8th grades. ... .91 (R) 15.3 ( S) 

Note that certain letters have been inserted parentheti- 
cally after the values for the reliability coefficients and the 
standard deviations. These letters refer to the following 
formula: 

«r 

S -y/l—T 

Small letters refer to the statistical constants for the nar- 
row range of talent (8th grade), and large letters indicate 
corresponding values for the broader range of talent (three 
grades pooled). It ^ould be noted that S here means the 
standard deviation of the wider range of talent, 15.3, and 
not summation, as we have previously used it. 

If we substitute for r, <r, and S the actual values obtained, 
and then solve the expression, we find a predicted value for 
R, thus: 

10.2 Vl— j? T> u- 1. D nn 

Tc-o= y, From which, if =.90. 

15.3 VI -.77 

This means that the increase in the variability of the group 
tested firom a standard deviation of 10.2 to a standard 
deviation of 15.3 should increase the reliability firom .77 to 
about .90, provided that the test functions equally reliably 
throughout such a range of three grades. Since the .90 
agrees very well with the actual value, .91, it may be con- 
cluded that the two sets of results are quite in harmony, the 
apparent discrepancy being explainable upon a basis of the 
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difference in the range of abilities normally found in a single 
grade and those found over a range of two or more grades. 

A reliability coefficient is thus seen to be a function of the 
standard deviation. For this reason the statement that a 
certain test has a reliability of .65 or .89 is quite meaningless 
unless additional facts are given. In addition, some measure 
of heterogeneity or range of talent is needed. In rough work 
it is often meaningful to express range of talent in terms of 
the grades or ages of pupils tested. But, as our formula 
implies, the best measure of heterogeneity is the standard 
deviation. It follows that a reliability coefficient should be 
accompanied by a statement of the standard deviation. 
This entails no extra work, as we have seen, since the stan- 
dard deviation is given as a by-product of the computation 
of the reliability coefficient. 

The reader should be reminded at this point of the dis- 
cussion under Problem 5, The Accuracy of an Individual 
Score. A formula was given for the probable error of a 
score, as follows: 

PE (Score) = .6745-^i±^ VT3^. 

It should be noted that this formula includes both the 
standard deviation and the reliability coefficient. If the 
range of talent is increased (e.g., several grades instead of 
one), the standard deviations become larger, but the re- 
liability coefficient also increases in size. If all other factors 
are constant, these two measures vary together. The 
PE(Scoie) is therefore roughly constant. For this reason it 
is usually more meaningful than the reliability coefficient 
itself. 

If we compute the P-E(s«ore) for both narrow and wide 
ranges of talent for the history test we have been using as an 
example, we obtain: 

(Narrow Range) PE (Score) = (.6745)(10.2) Vl -.77 =3.0 

(Wider Range) PE (Score) = (.6745) (15.3)Vl-.91-3.1 
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These probable errors of score are practically constant, 
although the correlation was raised from .77 to .91 by pooling 
three grades. The slight disagreement is due to the fact that 
both the .77 and the .91 contain errors of sampling, since 
both are based upon rather small numbers of cases.^ 

Table 86 will be found convenient for reading directly 
values of reliability coefficients (R) for a wider range of 
talent or heterogeneity when two facts are given: the 
reliability on a narrow range (r) and the ratio of the standard 


deviations for the two ranges 



Simple interpolation will 


be satisfactory for intermediate values. 


iln the fonnxila for the PE (Score) as used here, the average of the two standard devia- 
tions is not taken since is assum^ to be equal to oy. (See Problem 5 of this chapter.) 
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34. Corning, G. B., “The Meaning of Students' Marks," School Revieuj, 
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1923. 

36. Feingold, G. a., “Commutation of I Q's into Percentage Grades 
Corresponding to Those Commonly Used in Marking Scholarship.” 
Educational Administration and Supervision,^^* 11, pp. 251-263, 1925. 

See also Symond's critidsm. loc. cit., pp. 264-266. 
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Vol. IV, 1926, No. 11 Tests for carpenter 
VoL rV, 1926, No. 12 Tests for vegetable gardener 
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Vol. V, 1927, No. 2 Tests for instrument man 
Vol. V, 1927, No. 3 Tests for police sergeant 
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Geometry,” School Science and Mathematics, Vol. 17, pp. 969 ff, 1927. 
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345. Weiss, A. P., “On Methods of Mental Measurement Especially in 
School and College,” Journal of Educational Psychology, Vol. 2, 
pp. 555-563, 1911. 

So far as the present writer can determine, this is the first published study of the 
completion test. Many of the suggested studies have been carried out by later 
writers. 
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IX. Selected Textbooks on Educational Measurement 

352. Brooks, S, S., Improving Schools by Standardized Tests, Boston, 
Houghton Mifflin Company, 1922. 273 pp. 

A very elementary but practical discussion of the values of educational tests. 

353. Corning, H, M., After Testing, What? Chicago, Scott, Foresman 
and Company, 1926. 213 pp. 
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354. Fenton, N. and Worcester, D. A., An Introduction to Educational 
Measurements, Boston, Ginn and Company, 1928. 149 pp. 
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the Classroom Teacher, New York, The Century Company, 1924. 
265 pp. 

One of the first treatments to bring the subject up to date at the time of publication. 
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358. Hxni, C L., AptUude Testing, Yonkers, The World Book Company, 
1928. 535 pp. 

One of the fe?r scholarly treatments in the literature on educational measurement. 
Although the spedal emphasis is on the measurement of aptitudes, this vcdume 
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359. Kelley, T. L., The Interpretation of Educational Measurements^ 
Yonkers, The World Book Company, 1927. 363 pp. 
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Houghton Miffiin Company, 1923. 363 pp. 
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and Measurements, Boston, Houghton Mifflin Company, Revised 
Edition, 1924. 521 pp. 
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363. Paulu, E. M., Diagnostic Testing and Remedial Teaching, Boston, 
D. C. Heath and Company, 1924. 371 pp. 
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364. Pressey, S. L., Introduction to the Use of Standard Tests, Yonkers, 
The World Book Company, 1922, 263 pp. 

A book for the classroom teacher just beginning the study of measurements. 

365. Ruch, G. M. and Stoddard, G. D., Tests and Measurements in High 
School Instruction, Yonkers, The World Book Company, 1927. 
381pp. 

A treatment of high-sdKJOl tests containing a great many determinations of 
rdiability and vahdity not previously available. Sections are provided on objective 
teats, test construction, and criteria for the selection of tests. 

366. Smith, H.L. and Wright, W. W., ‘‘Second Revision of the Bibliography 
of Educational Measurements,” Bulletin of the School of Education, 
Indiana University, VoL 4, No. 2, November, 1927. 251 pp. 

The most complete and useful bibliography of educational tests which has appeared 
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tions of each test. 

367. Smith, H. L. and Wright, W. W., Tests and Measurements, New 
York, Silver, Burdett, and Company, 1928. 540 pp. 

Perhaps the most modem and best balanced treatment of the subject available. 
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368. Symonds, P. M., Measurement in Secondary Education, New York, 
The Macmillan Company, 1927. 588 pp. 

The most comprehensive and extensive treatment of bigh-school tests available. 
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369. Trabue, M. R., Measuring Results in Education^ New York, The 
American Book Company, 1924. 492 pp. 

An extended treatment with special reference to the uses of test resxilts in bettering 
dassroom instruction. 

370. Van Wagenen, M. J., Educational Diagnosis, New York, The Mac- 
millan Co., 1926- 276 pp. 

As the title 8u^:e3ts, the book deals with actual problems of measuring and dasai- 
fying pupils, together with corrective work arising from the application of testa. 

371. Wilson, G. M. and Hoke, K. J., How to Measure, New York, The 
Macmillan Company, Revised Edition, 1928. 597 pp. 

One of the better treatments of the subject. The authors have succeeded in keep* 
ing the needs of the teacher in mind better than many other writers. 

372. Wood, B. D., Measurement in Higher Education, Yonkers, 1923. 
337 pp. (See Reference 23.) 

No college teacher can afford to be ignorant of the pioneer investigations reported 
here on education^ and mental testing in college instruction. 


X. Selected References on Statistical Methods 

373. Garrett, H. E., Statistics in Psychology and Education, New York, 
Longmans, Green and Company, 1926. 331 pp. 

On the whole, this volume is about as useful to the beginner as any existing treat- 
ment. There are especially strong treatments of measures of rehabflity and partial 
correlation. The needs of educators have been kept well in mind. 

374. Holzinger, K. J., Statistical Methods for Students of Education, 
Boston, Ginn and Co., 1928. 372 pp. 

A well-written and authoritative treatment, somewhat more advanced than 
Garrett, but within the ability of any student with a good working knowledge of 
algebra. 

375. Kelley, T. L., Statistical Method, New York, The Macmillan Com- 
pany, 1923. 390 pp. 

An advanced treatment of the field of statistics. The basic theorems of test valida- 
tion and reliabilities are derived and applied. Some knowledge of higher mathematics 
wiU be needed to follow i>arts of the discussion. 

376. Otis, A. S., Statistical Method in Educational Measurement, Yonkers, 
The World Book Company, 1925. 337 pp. 

A fairly elementary discussion of the handling and interpretation of test scores. 
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377. Rugg, H. O., a Primer of Graphics and Statistics, Boston, Houghton 
Mifflin Company, 1925. 142 pp. 

A beginner’s book on graphs and the more elementary statistical measures. 

Note: Since this bibliography was prepared, Odell has published a 
fully annotated bibliography of 300 classified titles, as Bulletin No. 43 of 
the Bureau of Ekiucational Research, University of lUmois, January 15, 
1929. 
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